| Advanced Statistical 
| Methods in 


Biometric Research 


WILEY PUBLICATIONS 
IN STATISTICS 


Walter A. Shewhart, Editor 


Mathematical Statistics 
RAO—Advanced Statistical Methods in 
Biometric Research. 
KEMPTHORNE—The Design and Anal- 
ysis of Experiments. 
DWYER—Linear Computations. 


FISHER—Contributions to Mathematical 
Statistics. 


WALD— Statistical Decision Functions. 
FELLER—An Introduction to Probability 


Theory and Its Applications, Volume 
One. 


WALD—Sequential Analysis. 


HOEL—Introduction to Mathematical 
Statistics. 


Applied Statistics 
GOULDEN—Methods of Statistical Anal- 
ysis (in press). 
HALD—Statistical Theory with Engineer- 
ing Applications. 
HALD—Statistical Tables and Formulas. 


YOUDEN — Statistical Methods for 
Chemists, 


MUDGETT—Index Numbers. 


TIPPETT—Technological Applications of 
Statistics, 


DEMING—Some Theory of Sampling. 


COCHRAN. and COX—Experimental 
Designs. 


RICE—Control Charts, 
DODGE and ROMIG—Sampling Inspec- 
tion Tables. 
Related Books of Interest to Statisticians 


HAUSER and LEONARD—Government 
Statistics for Business Use. 


Advanced Statistical 
Methods in 


Biometric Research 


C. RADHAKRISHNA RAO 
Professor of Statistics 

Indian Statistical Institute 

Calcutta 


New York - John Wiley & Sons, Inc. 
London - Chapman & Hall, Limited 


CopyricHt, 1952 
BY 
Jons Wiery & Sons, Inc. 


All Rights Reserved 


This book or any part thereof must not 
be reproduced in any form without 
the written permission of the publisher. 


Copyriant, CANADA, 1952, INTERNATIONAL Coryricnt, 1952 
JOHN Winey & Sons, Inc., PROPRIETORS 


All Foreign Rights Reserved 
Reproduction in whole or in part forbidden. 


Bureau Ednl. Psy. Research 
DAVID HARE TLAINING COLLEGE 


WRG Silico os Pe 9 
Aecs. No. 1S : 


Library of Congress Catalog Card Number: 52-5325 


PRINTED IN THE UNITED STATES OF AMERICA 


In memory of 
my father 
the late C. D. Naru 


a 


The statistician is no longer an alchemist expected to produce 
gold from any worthless material offered him. He is more 
like a chemist capable of assaying exactly how much of value 
it contains, and capable also of extracting this amount, and 
no more. In these circumstances, it would be foolish to 
commend a statistician because his results are precise or to 
reprove because they are not. If he is competent in his craft, 
the value of the result follows solely from the value of the ma- 
terial given him. It contains so much information and no 
more. His job is only to produce what it contains. 


R. A. FISHER 


Preface 


The ever-increasing need for more searching and finer analyses of 
statistical data in various domains of human activity has constantly 
been giving rise to new concepts and improved methods. The vast 
amount of research carried out during the past few decades in the field 
of theoretical and applied statistics has been responsible for the dis- 
carding or recasting of some of the older methods in statistics and for 
the creation of a wealth of new statistical tools for the research worker 
and the routine analyst. The popularity and range of application of 
any statistical method, however, has always remained in great measure 
dependent on the logical elucidation it received and the simplicity of 
procedure it was capable of, mainly because more often a method that 
is powerful is also complex in theory and procedure. 

The object in writing this book is to present a number of statistical 
techniques, keeping in view the requirements of both the student who 
questions the basis of a particular method employed and the practical 
worker who seeks a recipe for the reduction of his data. I have there- 
fore endeavored first to provide a theoretical groundwork for the differ- 
ent methods to satisfy the former and second to illustrate computational 
procedures by working out a number of problems in full to meet the 
demands of the latter. 

Throughout this book, efforts have been made to integrate a large 
collection of computational schemes into consistent patterns. Thus the 
problems of regression and analysis of variance and covariance reduce 
to fitting of constants by the method of least squares and evaluation 
of the least sum of squares. The problems of multivariate analysis 
resolve themselves into an analysis of the dispersion matrix and re- 
duction of determinants. The reduction of a matrix by the method of 
Pivotal condensation emerges as the most useful technique which at the 
same time does not present any computational difficulty. The check 
column, properly carried, ensures numerical accuracy. The different 
computational schemes have been illustrated in detail with the help of 
original data. These data have also been reproduced, either in full or, 
in the case of extensive data, in the form of necessary statistics such as 
totals, sums of squares, products, etc., giving reference to the sources 


from which they are taken. 
ix 


ae PREFACE 


The material presented for illustrative purposes has been restricted 
mainly to anthropometric studies, but without prejudicing the statistical 
methods used in their applicability to problems in various other branches 
of knowledge such as general biology, psychology, economics, and other 
social sciences. The problem of neurotic groups considered in section 
9d.1 provides an example from psychology. For the psychological 
findings appearing in that section I am indebted to Mr. Patrick Slater. 

In the development of the text, the first chapter is devoted entirely 
to mathematical procedures in modern algebra. In the second chapter it 
is shown that most of the distribution problems connected with uni- 
variate and multivariate normal populations can be solved by a funda- 
mental theorem on least squares, whose proof needs only a knowledge of 
linear transformations. 

The third chapter first deals with applications of the least square tech- 
nique in the estimation of parameters. It next traces as the starting 
point of exact sampling theory the fundamental discovery by W. $. 
Gosset, now known as “studentization,” that the probability of the error 
in the observed average expressed as a multiple of the sample standard 
deviation admits a precise evaluation independently of the unknown 
standard deviation of the population. Then it proceeds with tests of 


linear hypotheses. The tests discussed in this chapter, exact for small. 


samples, are due to R. A. Fisher, who, having developed the theory of 
these tests, also put forward the elegant computational scheme of the 
analysis of variance table. 

The fourth chapter contains some observations on the general theories 
of estimation and certain applications of the method of maximum 
likelihood. The scoring system of R. A. Fisher discussed and elaborated 
in this chapter introduces great simplicity and mechanization in the use 
of the maximum likelihood principle, thus providing a complete answer 
to critics who hold that the method of maximum likelihood leads to 
intractable equations. Certain alternative methods of deriving asymp- 
totically best estimates advocated by some authors neither have gen- 
eral applicability nor would admit mechanized computation. 

Problems of specification and associated tests of homogeneity form 
the subject matter of the fifth chapter. The choice of a mathematical 


be deemed to have arisen is of 
uent statistical computations 
chosen model. The choice is 
r “but this empiricism could be cleared of its dangers 
if we can apply a rigorous and objective test of the adequacy with which 
the proposed population represents the whole of the available facts.” 


e first to visualize this approach. The x? goodness 


PREFACE E 


of fit introduced by him has been found extremely useful in many other 
directions too. 

The next chapter gives tests of homogeneity of variances and co- 
variances, and these form the preliminary investigation in multivariate 
analysis which is detailed in the seventh chapter. The eighth and ninth 
chapters relate to the utilization of multiple measurements in problems 
of biological classification. Biologists are usually confronted with the 
problem of assigning an individual to one of several groups to which he 
might belong. An objective method which minimizes the errors of 
classification is provided in the eighth chapter, utilizing the modern 
theories of inference as developed by J. Neyman and A. Wald. Chapter 
9 is devoted to a study of the interrelationships between a number of 
populations or groups of individuals. The methods of this chapter are 
based on the researches carried out in the Indian Statistical Institute 
under the inspiring guidance of P. ŒC. Mahalanobis, who introduced the 


concept of group distance. 
C. R. Rao 


Calcutta, India 
June 1952 


1b 


lc 


ld 


2a 


2b 


2c 


2d 


Contents 


CHAPTER 1- ALGEBRA OF VECTORS AND MATRICES 


Vector SPACES ..---+--- Sis koe 
la.l Vectors. la.2 Linear Independence and Orthogonality. 1a.3 
Vector Spaces and the Sweep-Out Method. la.4 The Orthogonal 
Vector Space and the Deficiency Matrix. la.5 Linear Equations. 


THEORY oF MATRICES AND DETERMINANTS . . 6 6 ee oe ee ee ee 
1b.1 Matrices. 1b.2 Partitioned Matrices. 1b.3 Determinants. 


Quapratic Forms... - ++: ee te oer ee ee ee P 
1c.1 Definitions. 1c.2 Linear Transformations. le.3 Classifica- 
tion of Quadratic Forms. 1c.4 The Latent Roots of a Matrix and the 


Characteristic Vectors. 1c.5 Pairs of Quadratic Forms. 1c.6 Re- 
duction of an Asymmetric Matrix. 


NUMERICAL APPENDIX . so et tts ae ee ee Dii 
1d.1 The Evaluation of Determinants, Reciprocals, and Solutions of 


Equations. 


CHAPTER 2- THEORY OF DISTRIBUTIONS 


Some ANALYTICAL METHODS IN DISTRIBUTION PROBLEMS + Ay 
2a.1 Binomial Distribution. 2a.2 Multinomial Distribution. 24.3 
The Poisson Distribution. 2a.4 Normal Distribution. 2a.5 Gamma 
Distribution. 2a.6 Beta Distribution. 2a.7 Cauchy Distribution. 
2a.8 Pearson’s Px Distribution. 2a.9 Summary of Results. 

DISTRIBUTIONS RELATING TO THE UNIVARIATE NORMAL DISTRIBUTION 
2b.1 Mean and Variance in Normal Samples. 2b.2 Student’s Distri- 


bution. 2b.3 Fisher’s z Distribution. 2b.4 Cochran’s Theorem. 


2b.5 Distribution of Non-Central x. 


MULTIVARIATE NORMAL POPULATIONS . . + - - 


2c.1 The Multivariate Normal Distribution. 
of a Set of Linear Functions of Normal Variates. 


bution of Quadratic Forms. 


2c.38 The Distri- 


LEAST SQUARES FUNDAMENTAL IN DISTRIBUTION THEORY RA 
2d.1 Two Theorems on Least Squares. 2d.2 Multivariate Distri- 


butions. 


18 


29 


32 


46 


51 


58 


xiv 


3a 


3b 


3c 


3d 


8e 


3f 


3g 


4a 


ha 


CONTENTS 


CHAPTER 3- THE THEORY OF LINEAR ESTIMATION 
AND TESTS OF HYPOTHESES 


LINEAR ESTIMATION 


3a.1 Observational Equations. 3a.2 Best Unbiased Estimates. 
3a.3 The Necessary and Sufficient Condition for the Existence of an 
Unbiased Estimate. 3a.4 Normal Equations. 3a.5 Linear Func- 
tions with Zero Expectations. 3a.6 Standard Errors of Estimates and 
Intrinsic Properties of Normal Equations. 3a.7 Principle of Substi- 
tution. 8a.8 Observational Equations with Linear Restrictions on 
Parameters. 3a.9 Observational Equations with Correlated Variables. 


Tests or LINEAR HYPOTHESES .........,., Ket ee Ee 
3b.1 Nature of Linear Hypotheses. 3b.2 Test for Ho. 3b.3 Test 
for Ho when Ro Is Not True. 

Tue COMBINATION OF WEIGHTED OBSERVATIONS , . . _ . EENE 
3c.1 Transformation to Unweighted Observations, 3¢.2 An Example 
of Weighted Observations. 

TESTS or HYPOTHESES WITH A SINGLE DEGREE or FREEDOM... . . 
3d.1 Student’s t Test. 3d.2 Asymmetry of Right and Left Femora. 
ANALYSIS OF VARIANCE. ©... 1... . 


8e.1 One-Way Classification. 3e.2 Two-Way Classification with a 
Single Observation in a Cell. 3e.3 Two-Way Classification with 
Multiple but Equal Numbers in Cells. 3e.4 Two-Way Classification 
with Unequal Numbers in Cells. 


3f.1 The Concept of Regression. 3f.2 Prediction of Cranial Capacity. 
3f.3 Test for the Equality of Regression Equations. 3f.4 The Test for 
an Assigned Regression Function. 


Tue GENERAL PROBLEM or Least Squares wirn Two Sers or P. 
BIBRBY 2 6 wishis w Sta ye %,  @ a 

8g.1 Concomitant Variables. 

Variation. 3g.3 An Illustrative Example. 3g.4 A Problem of 


ARAM- 


CHAPTER 4. THE GENERAL THEORY OF ESTIMATION 
AND THE METHOD oF MAXIMUM LIKELIHOOD 


Brsr UNBIASED ESTIMATES k 


4a.1 Estimation by Minimizing the Variance. 4a.2 The Information 
Limit to Variance: A Single Parameter. 4.3 Distributions Admitting 
Estimates with the Information Limit to Variance. 4a.4 Sufficient 
Statistics and Unbiased Estimates, 42.5 Distributions Admitting 
Sufficient Statistics. 4a.6 An Optimum Property’ of Sufficient Statis- 
tics. 48.7 More Stringent Inequalities for the Variance of an Estimate, 


82 


87 


89 


102 


118 


129 


4b 


de 


5a 


5b 


5d 
5e 


5f 


6a 


CONTENTS 


4a.8 The Case of Several Parameters. 4a.9 Properties of Distribu- 
tions Admitting Sufficient Statistics: Several Parameters. 


ESTIMATION BY THE METHOD OF MAXIMUM LIKELIHOOD . ..... . 
4b.1 The Principle of Maximum Likelihood. 4b.2 Consistency and 
Bias. 4b.3 The Concept of Efficiency. 4b.4 Some Optimum Proper- 
ties of Maximum Likelihood Estimates. 


Some EXAMPLES OF MAXIMUM LIKELIHOOD Estimates... .. ~~ 


4e.1 Improved Estimates of Means from Incomplete Data on Several 
Variables. 4c.2 The Method of Scoring for the Estimation of Param- 


eters.  4¢.8 Combination of Data. 


APPENDIX: Some LIMITING THEOREMS. . . s 1 ee ee ee 


CHAPTER 5- LARGE SAMPLE TESTS OF HYPOTHESES 
WITH APPLICATIONS TO PROBLEMS 
OF ESTIMATION 


Tus Genera, Tueory or Tests IN LARGE SAMPLES 2 pbk RK 
5a.1 The Nature of Statistical Hypotheses. 5a.2 The Problem of 


Distribution. 


Appiications OF THE GENERAL THEORY . >. ss eee ees 
5b.1 The x? Test of Departure from a Simple Hypothesis. 5b.2 The 
x? Tost of Goodness of Fit. 5b.3 Tests of Homogeneity of Parallel 


Samples. 


Conminaenoy DDE 4.00) 62 8S DHS HH Se ee eG 
5e.1 The Probability of an Observed Configuration and Tests in Large 
Samples. 5e.2 Tests of Independence in a Contingency Table. 5e.3 


Tests of Independence in Small Samples. 


Tuers iv Porssow POPULATIONS s a mdi p é SHH He eH ee 


TRANSFORMATION OF STATISTICS 
5e.1 A General Lemma. 5e.2 The Square Root Transformation of 
the Poisson Variate.  5¢.3 The Sin-! Transformation of the Binomial 


Proportion. 5e.4 Other Useful Transformations. 
Jerrors oF MOMENTS ..- +--+ sees 


Raw Moment Statistics. 5f.2 Large 
Means and an Illustration of the Py 


LARGE SAMPLE STANDARD 
5f.1 Variances and Covariances of 
Sample Tests of Difference between 
Test.  5£.3 Tests of Normality. 


CHAPTER 6- TESTS OF HOMOGENEITY OF VARIANCES 
AND CORRELATIONS 


HOMOGENEITY or VARIANCES - ++" UOT SS TERE x 
6a.1 Test for a Specified Variance. Ga.2 Test for a Specified Inequality 


of Two Estimated Variances. 6a.3 The Likelihood Criterion and Its 


150 


161 


172 


176 


179 


191 


205 
207 


215 


221 


6b 


Ta 
7b 


Te 


7d 


8a 


8b 


CONTENTS 


Use. 6a.4 Practical Applications. 62.5 Problems Requiring an 
Exact Treatment. 


HOMOGENEITY OF CORRELATIONS 


6b.1 Exact Test for Zero Correlation. 6b.2 Fisher’s Tanh™! Trans- 
formation. 6b.3 Test for a Given p. 6b.4 Test for the Equality of 
Two Correlation Coefficients. 6b.5 Test for the Homogeneity of a Set 
of Correlation Coefficients. 6b.6 Correction for Bias in the Test for 
Homogeneity and the Best Estimate of p. 


CHAPTER 7- TESTS OF SIGNIFICANCE IN MULTIVARIATE 
ANALYSIS 


REVIEW or Work ON MULTIVARIATE ANALYSIS 


Tests WITH DISCRIMINANT FUNCTIONS . . o,o, 2 
7b.1 Two Fundamental Distributions. 7b.2 Problems of a Single 
Sample. 7b.3 Mahalanobis’ D? and Problems of Two Samples. 7b.4 
Test for an Assigned Discriminant Function. 7b.5 Tests for Dis- 


criminant Function Coefficients. 7b.6 The Additional Information 
Supplied by Some Characters. 


GENERALIZATION OF D? AND THE LARGE SAMPLE 


Tu EORY FOR Srv ERAL 
Grours 


Tests WITH WILKS’S A CRITERION 
7d.1 Analysis of Dispersion 
terion. 7d.2 The Distribution of A and Its Practical Use. 74.3 


Test of Differences in Mean Values for Several Populations. 7d.4 In- 


ternal Analysis of a Set of Variates. 74.5 Barnard’s Problem of 
Secular Variations in Skull Characters, 


CHAPTER 8- STATISTICAL INFERENCE APPLIED TO 


CLASSIFICATORY PROBLEMS 


8a.2 Null Hypotheses. 82,3 
8a.4 Locally Most Power- 

8a.5 Test for a Finite Number of Alternatives. 

Alternatives Are Continuous, 

PROBLEMS or DISCRIMINATION 


8b.1 The General Problem. 


8a.6 Tests When the 


8b.2 The Discriminant Function of 
R. A. Fisher. 8b.3 Some Difficulties in the Use of the Best Discrimina- 
ting Solution. 8b.4 Uncertainty of the A Priori Information That, One 
of the Alternatives Is Correct, 8b.5 The Doubtful Region. 8b.6 
Resolution of a Mixed Series into Two Gaussian Components. 8b.7 
Sexing of Osteometric Material. 8b.8 The Problem of Three and 
More Groups. 8b.9 Application t 


. o Multivariate Normal Populations. 
8b.10 Allocation of a Number of Individuals to Two or More Groups. 


230 


257 


273 


286 


8c 


8d 


9a 


9b 


9c 


9d 


i INDEX 


CONTENTS 


DISCRIMINANT FUNCTION FOR SELECTING GENETICALLY DESIRABLE TYPES 
8c.1 Prediction Formula for the Genotypic Value. 8c.2 The Genetic 
Advance. 

PROBLEMS or Optimum SELECTION. < > += se se tt ts 
8d.1 A Single Predictor for Dichotomy. 8d.2 The Problem of Differ- 
ential Predictors. 


Api Al suk oe Ree EM ES Pew F Se we ens. es 
Al A Lemma of Neyman and Pearson. A2 A Generalization of the 
Neyman-Pearson Lemma. A3 A Slight Variation of Lemma Al. 


A4 A Lemma on Power Functions. A5 Two Lemmas Useful in 


Classificatory Problems. 


Arpon B a... 2s ean st 
B1 On a Transformation Useful in 
An Alternative Computational Scheme. 


CHAPTER 9- THE CONCEPT OF DISTANCE AND THE 
PROBLEM OF GROUP CONSTELLATIONS 

Diseivon puree Two! ROPUUATIONS zy oe eem tee ft ee 

9a.1 The Need for a Distance Function. 92.2 Mathematical Con- 


cepts (Discriminatory Topology). 9.3 Mahalanobis’ Generalized 
Distance. ga.4 Karl Pearson’s Coefficient of Racial Likeness. 


AN ILLUSTRATIVE EXAMPLE 3 
9b.2 The Determination of Group Con- 


9b.1 Calculation of D°. 
stellations. 


Tus Use or CANONICAL VARIAT 
9c.1 Graphical Methods of Repre: 
Problem of Maximal Average DA 

© NUMBER OF DIMENSIONS +... 


ES IN DERIVING Group CONSTELLATIONS 


senting the Groups. 9c.2 The 
9e.3 An Illustrative Example. 


A Tusr ror REDUCTION IN TH 
9d.1 The Analysis of Neurotic Cases. 


xvii 


329 


336 


339 


351 


357 


364 


CHAPTER 1 


Algebra of Vectors and Matrices 


ta Vector Spaces 


la.1 Vectors 

A set of ordered elements (21, ***, tn) is called a vector which may 
be simply denoted by x. The elements may be observations obtained 
in the order 21, T2; ***) Um from a population, in which case x is called 
the vector of observations. à 

Suppose that these observations are standardized by multiplying each 
of them by a constant ¢, otherwise known as a scalar. The resulting 
vector is (cx1, +++, ¢tn) Which may be represented by cx. This is the 


rule for multiplying a vector x by @ scalar c. 
The weighted sum of observations with the vector w of weights 


(wi, +++, Wp) is 
WX, + =a sf Wrin 
This may be represented by w-X, denoting the product of two vectors 
wand x. The simple average of the observations is the result of multi- 
plying the observation vector x by the weight vector (1/n, +-+, 1/n). 
The sum of squares of observations is the result of multiplying x by 
itself, i.e., x-x = x’ = aie feet seta sg 

If x = (2, «++, tp) is the vector of p measurements on an individual 
and y = (yı, ***, Yp) OD another, then the sum of the measurements 
is the vector (a1 + Yu "° 2 + yp) which may be represented by 
x+y. From the definitions given above we find the vector of average 


Measurements 
1 ae gı tY "e 
3 ErN=\—5 oe 


In general, 


ax + by + cz = (atı + byi + C1, 
where a, b, c are scalars and x, y, Z are vectors. The vector 0 = (0, 
Tt is easy to verify that any vector 

1 


+++, Qty + byp + Cép) 


-++, 0) is called the null vector. 


2 ALGEBRA OF VECTORS AND MATRICES 


added to the null one remains unchanged and any vector multiplied 
by the null vector reduces to zero. 


1a.2 Linear Independence and Orthogonality 

A set of vectors (x, y, ---) are said to be independent if none of them 
can be expressed as a linear combination of the rest. For instance, the 
vectors (1, 0, 1, —1, 2), (3, 2, —1, 1, 2), and (9, 4,1, —1, 10) are not 
independent since the last vector is the sum of 3 times the first and 2 
times the second. Two non-null vectors are said to be orthogonal if 
their product is zero. A set of orthogonal vectors with real elements 
are necessarily independent. If, not, then 


x=ay+bz+--- 
Multiplying by x 


XX =ax-y+bx-z+---=0 
r= +a2+.--=0 


which means that zı = z2 =---= 0 or x is a null vector. 


1a.3 Vector Spaces and the Sweep-Out Method 


The totality of vectors obtained by linear combinations of a set of 
vectors is called a vector space. Such a totality can be generated by a 
set of independent vectors called a basis of the vector space. If the 
number of vectors in a basis is m, then the vector space is said to be 
of rank or dimensions m. In practical problems it is sometimes nec 


essary 
to obtain the rank of a vector space formed by the vectors 


Gu G2 => Gin 
Q21 Q22 +++ dən 
Ar Arg tes Arn 


The arrangement of the elements aij in rows and columns as above is 
known as a matrix. A convenient way of finding the rank is by a 
method known as “sweep out,” consisting in the following operations: 

(i) Any vector having a non-zero value for the first element is taken, 


and all its elements are divided by the first so that the resulting vector 
is of the form 
Q, cs, 


If the elements of the first 
start with. 


“t, Cn) 


column are all zero, then it is omitted to 


ORTIIOGONAL VECTOR SPACE AND DEFICIENCY MATRIX 3 


(ii) From every other vector is subtracted a vector obtained by multi- 
plying (1, c2, +-+, Cn) by the first element of the former vector so that 
the resulting vectors, except the one chosen in (i), have zero as their 
first element. The first column is said to be swept out by the vector 
called the pivotal row chosen above. 

(iii) Omission of the pivotal row and the first column results in a 
reduced matrix on which operations (i) and (ii) are repeated until a 
single non-zero row or all null rows are left over. A single non-zero row 
left over may be regarded as the last pivotal row, in which case the 
rank of the matrix or of the vector space is equal to the number of 
Pivotal rows. An example is given below with four rows. 


Column No. 


Row No. (1) D © @ 
@ © d 2 3 
(2) 2 -1 5 4 
4&8) 4 0 6 1 
(4) 0 -2 4 7 
Operations on rows 
(5) 1-05 25 2 (2) +2 Ist pivotal row 
a a ee 
(6) 1 2 3 (1) — 56) x0 
(7) 9 sh =7 (3) — (5) X 4 
(8) —2 4 7 (4) — (5) X 0 
(9) 1 2 3 (6) +1 2nd pivotal row 
eee ee 
(10) 56. =% (7) — (9) X2 
(11) 8 13 (8) — (9) X —2 
(12) 1 13g (10) + —8 3rd pivotal row 
(13) ~ o +l) - 42) x8 


The rank of the vector space is, therefore, 3, the independent vectors 


being the pivotal rows 


1-05 25 2 
o p # 8 
0 0 1 13 


la.4 The Orthogonal Vector Space and the Deficiency Matrix 
The set of vectors orthogonal to all the vectors in a matrix generates 
an orthogonal vector space since any linear combination in it satisfies 


4 ALGEBRA OF VECTORS AND MATRICES 


the orthogonality condition. The basis of the orthogonal vector space 
can be found in a simple way by an extension of the sweep-out process 
described above. Consider the basis of the vector space in the above 
example and sweep out the third and second columns using the third 
and second rows, yielding 


100 —84¢5 
01 0-4 
0 0 1 13¢ 


which forms another basis for the same vector space. In general, the 
sweep-out method gives a reduced matrix (after a suitable rearrange- 
ment of columns if necessary to bring the unit elements to the diagonal) 
of the form 


oO «ce Bb by ao By 
Ok se OF Boy Fes bok 
S: 
0 0 +++ I bm -e By 


All these vectors are independent. Consider the set of independent 
vectors 


bu ba + ba —1 0 sine 0 
D: big bog =- br 0 -1 :-:- 0 
Cin bop es be O O e i 


which is obtained by a rearrangement of the elements in S. The vectors 
in D are mutually orthogonal to those in S. Since (r + k) is the total 
number of independent vectors possible, D contains all vectors orthog- 
onal to S and for this reason is termed the deficiency matrix. In the 
above example, the deficiency matrix consists of the single row 


D-H 4 # i 

Example 1. In a matrix the number of independent row vectors is 
equal to the number of independent column vectors. 

Example 2. If each vector consists of n elements, then there cannot 
be more than n independent vectors in a set. 

Example 3. Any vector of n elements can be expressed as a linear 
combination of any given set of vectors with rank n. 


LINEAR EQUATIONS 5 


Example 4. The rank of the matrix 


1 1 1 

1-- — < 

n n n 

1 1 il 

ea ee -> 

n n n 

1 i 1 

Le me ae TS 

n n n 
containing n rows and n columns, is (n — 1). 

la.5 Linear Equations 

sty Tn 


ry z 
The m equations in n unknowns 71, ` 
a111 peee H antn = 0 


Ami %1 eee Amntn = 0 


are called homogeneous linear equations. Since any solution considered 
as a vector is orthogonal to every vector of the matrix 


ay es Qin 
` 


M: 


the totality of the solution vectors forms a space orthogonal to M. 
The basis of this orthogonal space is the deficiency matrix D to be 
determined as in the previous section. If the rank of M is r, then the 
solution space has a basis consisting of (n — 7) independent vectors. 

Replacing the zeros in the above equations by bi, +++) bm, we obtain 
a set of non-homogeneous equations which can be regarded as homo- 
geneous in (n + 1) unknowns. 


aye, Hoe Mante — bitnyi = 0 


eet Amnn T bmtn+1 =0 


Only those solutions for which zay # 0 will yield solutions to the 
ns. Ike = (C1 **%s Cm) is any vector orthog- 


non-homogeneous equatio aves: 
onal to column vectors in M, then, multiplying the above equations by 


Am1T1 AE 


6 ALGEBRA OF VECTORS AND MATRICES 
C1, ***, Cm and adding, we obtain 
Tn4i(byey ese DinCm) =0 


If Zb,c; Æ 0 for at least one c, then tanı = 0. On the other hand, 
if (bi, ---, bm) is dependent on the column vectors in M, then bic; = 0 
for all c orthogonal to M, in which case the number of independent 
column vectors in N, obtained by adding the column vector with 
elements bj, bo, -+ +, bm to M, remains the same as before. This means 
that the homogeneous equations in (n + 1) unknowns have (n+1—r) 
independent solutions. Of these, (n — r) independent solutions are the 
vectors in D, the deficiency matrix of M , with zeros added to form 
the (n + 1)th elements. The one more independent solution must 
necessarily have a non-zero value of Zn41, for otherwise it leads to a 
contradiction. Hence the necessary and sufficient condition for the 
non-homogeneous equations to have a solution is that (bi, +++, bm) is 
dependent on the column vectors or that the ranks of N and M are 
equal. 


Example 1. Find the value of ô for which the equations in three 
unknowns 


2x1 — to + 5x3 = 4 
4x, + 6273 = 1 


—2t2 + 4r =7 +8 
admit a solution. 


.1b Theory of Matrices and Determinants 
1b.1 Matrices ` 


A matrix is, in general, an 


arrangement of pg elements in p rows and 
q columns. If A and B are t 


Wo matrices of the form (p, q) 


Giy iee Gig bi ee Dig 
A= . eee . ) and B= ( è pae ‘ ) 
ap e ap Bo eeu D 


then matrix addition ig defined by 


itd, + Qia + big 
ass : on , 


Gr + byt +++ dng + bu 


When there is no ambiguity about the number of rows and columns, 


MATRICES 7 


it may be convenient to denote the matrices A and B by (a;;) and 
(bi), in which ease A + B = (aij + b,;). This process is known as 
matrix addition. Just as in the case of vectors, a matrix can be multi- 
plied by a scalar c, the law of multiplication being 


C04, *** Caig 
gaa a or E }= (cas) 


ot sts Capa 


and B such that the number of columns in 


Consider two matrices A ‘ 
f rows in the second. For instance, 


the first is equal to the number 0 


bu biz b 
a1 U2 11 012 013 
EET 
j (a ne, on bor b22 sides 
of A and B is defined to be 


The product AB 
duct of the ith row vector of 


are two such matrices. 
t is the pro 


a matrix whose (i, j)th elemen 
A and the jth column vector of B. 
ayibig + di2b22 ayibis + =) 


arbi + Gizba 
ao1b13 + G2ebe8 


AB = 
tlie + dogbo1 aaibie + a22b22 
can be multiplied in the above manner 


A is equal to the number of columns 
ber of rows as in A and 


In general, two matrices A and B 
only when the number of rows m 
in B. The resulting matrix has the same num 
the same number of columns as in B. 

The product AB is not, in general, equal to the product BA. When 
the product AB is considered, A is said to be post-multiplied by B 
or B pre-multiplied by 4. 

eroon a e annette so that the product of three matrices 
A, B, C can be done in any of the following ways: 

ABC = A(BC) = (AB)C 


For multiplication to be compatible the matrices A, B,C should pe icf 
the form (p, q), (@7) (r, s), in which case the triple product is of the 
form (p, s). Observe the rule: 
(p, DaN) = 8) 
The matrix A’ obtained by interchanging the rows and columns of 
A is called thé transpose of From the definition it follows that 


(AB) = BA’ (ABC) = C'B'A’ etc. 


8 ALGEBRA OF VECTORS AND MATRICES 


If x is a row vector, then x’ will be a column vector, in which case the 
vector multiplication x-y of two vectors x and y can also be written 
xy’. Both representations will be used throughout. f 

A matrix that contains all zero elements is called a null matrix. It 
is easy to verify that the addition of a null matrix leaves any matrix 
unaltered whereas multiplication by a null matrix reduces any other 
matrix to a null matrix. 

A matrix with equal number of rows and columns having unity for 
all its diagonal elements and zero elsewhere is called a unit matrix 
represented by I. It is easy to verify that 


AI=A and IA=A 


provided that multiplication is permissible. The distributive law holds 
for matrix multiplication. 


A(B+C)=AB+ AC 
(A + B)(C + D) = AC + AD + BC+ BD 


A matrix of the form (n, n) is said to be a square matrix. A square 
matrix is said to be symmetric if the elements in the ith row, jth column 
and the jth row, ith column are equal. 

The rank of a matrix, as defined in la.3, i 
rows it contains. This is also equal to 
columns (example 1, la.4). 
la.3 is very convenient for di 
following examples concerning 

Example 1. If Aisa squa: 
such that 


s the number of independent 
the number of independent 
The method of sweep out discussed in 
etermining the rank of a matrix. The 
the rank of a matrix will be useful. 

re symmetric matrix of the form (n, n), 


AU —A)=0 
where I is'the unit matrix, then rank A + rank (I-A) =n. 

The condition A(I — A) = 0 implies that vectors in A, being orthog- 
onal to the vectors in (I — A), are independent of the vectors in (I — A), 
If r and s are the numbers of independent vectors or the ranks of A 
and (I — A), then the number of independent vectors in A and (J — A) 
put together is (r + s). If every row in (I — A) is replaced by the 
sum of the corresponding rows in (J — A) and A, then n rows of the form 


1 Ò s D 
OI.. 0 
0 0 1 


PARTITIONED MATRICES 9 


are obtained. These being independent, it follows that the number of 
independent vectors in A and (I — A) is not less than n. Since there 
are only n elements in a vector, this is the maximum possible number 
of independent vectors. Hence (r + s) = n. 

“Example 2. The rank of the product AB is not greater than the rank 
of A or the rank of B. 

The product AB can be obtained from B by suitable linear combi- 
nations of rows. This process does not increase the number of inde- 
pendent rows in B. Therefore rank AB > rank B. Also (AB)! = ° 
B'A’, so that the above argument leads to the result rank (AB)! > 
rank A’, which means that rank AB > rank A. 

Example 3. Let a, +++, ær be a set of r vectors generating a vector 
space. If Bı, +++, Be are s independent vectors, then the maximum 
number of independent vectors belonging to the B space and lying 
entirely in the æ space is equal to t, where (s — t) is the rank of the 
matrix (ê: B), @ = 1, +, j=1,---, s), and 8, ô ++- are the inde- 
pendent vectors generating the vector space orthogonal to the æ space. 
In other words, there are ¢ independent linear functions of B which can 
be expressed as linear functions of œ only. 

Consider any linear function 


Y= Bı G eof lsBs 
and express the condition that it is orthogonal to 6), 82, +-+. 
hêr Bi Heet 6:-Be =O *=1,2,--- 


The number of independent solutions is evidently s minus the rank 
of (8;-B,). Each solution supplies a linear combination of B lying in 
the æ space. If Yı °**, Ye are these vectors, then there are (s — t) 
more vectors belonging to the B space, Yz41, ***, Ys Such that no linear 
combination of Yi41, ***» Ys belongs to the œ space, for otherwise a 
contradiction is obtained. 

Example 4. If the row vectors of a square matrix are all mutually 
orthogonal and the square of each row vector is unity, so are its column 


vectors. 


1b.2 Partitioned Matrices 

Sometimes it is convenient to represent a matrix obtained by the 
juxtaposition of two or more matrices in a partitioned form. Thus a 
partitioned matrix A is represented by 


. 
10 ALGEBRA OF VECTORS AND MATRICES 


where the rows in P equal in number those in Q, the columns in P 
equal those in R, and so on. By definition, 


If 


then 
AB eC =) (eae eae) 
7 E GH/ \RE+SG RF+SH 
provided that the products PE, etc., are permissible. 


1b.3 Determinants 


A determinant is a real valued function of the elements of a square 
matrix. If x1, X2, ---, x» denote the row vectors, then the function 


may be represented by D(x, ---, Xp). We shall choose the function 
to satisfy the following conditions: 


(a) Da, +++, Em, < +, Xp) = cD(K, +++, Xm, + *, Xp), Where c is a 
scalar. 

(b) Dix, +++, Xm +X, ++, Xp) = DK, +++) Xmy aes, Xp), for m 
xk 


(c) D(e1, +++, ep) = 1, where 1, ***, €p constitute the vectors of 
a unit matrix and are called elementary vectors. 


Let D exist when the following properties hold: 


(1) If x; = 0, then x; = Ox;; hence, by (a), D 


= 0, putting c = 0, 
(2) D(x, ty Xm CXk, +++, Xp) = Dx, mzs 


“s Xm ee Xp): 
This is true when c = 0. If cis not zero, then 


D(X1, +++) Xm + Ox, ++, Xp) 


= = 5 PGi Em + eke, os, Ey +++, Xp) by (a) 


1 
= = Pty +5 my 2+, exp, ++, x) by (b) 
= D&,+++,%)) by (a) 
(3) From (2) it follows that, 


) 1 if the rows are dependent, then D = 0. 
As a particular case, if two row: 


S are identical the determinant vanishes. 


DETERMINANTS 11 
(4) If two rows are interchanged, then D changes sign. 
D = D(x, +++, Xm + Zr °°) Xe) Xp) 


= Dk, +++) Xm + Xk, +++) Xk — Xm — Xey +++, Xp) 


= D(x, <t, Xm + Xk, tty Xny +++, Xp) 
= D(x, ey ky °°") = Em ** 7 Xp) 
= —D(K1, +t; Xk) 7°) Xm, ity Xp) 


In general, an even permutation of rows does not alter the determinant 
whereas an odd permutation changes sign. 


(5) 
D(x, +++, Xm HY, tt Xp) 
= Dap a Xin, 27 7p Kp) Dy, * 8 Vy 0) 


If xm depends on the rest of x, the result is established by subtracting 
a suitable linear combination of other vectors from the mth vector in 
each of the above determinants. If the other vectors are themselves 
dependent, then each term above is zero. The only alternative is that 
all x are independent, in which case y is necessarily a linear function of 
x (because there cannot be more than p independent vectors). 


y = 2cx; 
D(x, +++, Xm + DOR, oo x)= a+ Cm)D(X1, +++, Xm) °**, Xp) 
= D(z, ee a T Xp) + D(x, ose, c ET Xp) 
= D(x1, °°) Zm °°) Xp) + DEn +++, Bows, +++, Xp) 


(6) D(x, + yi, +t Xp +y) = DD(z, «++, Zp), where z; can be x; 
or y; and the summation is over 2” possible sets, Z1, +++, Zp. This 
follows by repeated application of (5). 

(7) Any vector Xm = Gmi€1 + Am282 ++ +++ Amp€p; where e1, +°, 
€p are elementary vectors. Therefore 


Depr #5 Sg Xp) = D(2ari®i, X2, +++, Xp) 


= Lay:D(Ci, X2, +++, Xp) 


l 


Zaria2;D(e;, ej, +++, Xp) 


= Balj `t ap: D(e;i, ej, aa ex) 


12 ALGEBRA OF VECTORS AND MATRICES 


In the final summation, 


D(ei, e; +++, ex) = 0 whenever two suffixes are equal 
=+1 when 7, j, ++- is an even permutation of 
1, 2, e T 
= —1 when 2, j, -++ is an odd permutation 
Hence 


D = È + diloj +++ apk 


The function so derived satisfies the conditions (a), (b), and (c) given 
above and is called the value of the determinant of the square matrix 
(a;;) and is also denoted by | aij | or simply | A i 

8) j 

D = ZamiD(K1, +++, ei, +++, Xp) 


= ZdmiAmi 


where Am: is the determinant obtained by replacing the mth vector 
by e; and is called the cofactor of ami. The minor of ami is defined as 
the determinant obtained by omitting the mth row and ith column. 
The cofactor is obtained from the minor by the relation 


An: = (—1)**” X the minor of Omi 


(9) It is easy to verify that 
> OmzA 53 =0 if mFS 


because this is the value of a determinant with the mth and sth rows 
identical. 


From the definition it follows that, when the rank of the matrix A 
of the type (n, n) is less than n, then | A | = 0. To prove the converse, 
it is necessary to recall the sweep-out method described in 1a.3. When 
a column is swept out the only operation that changes the value of the 


determinant is the division by a non-zero element, also called the 


pivotal element, to obtain the pivotal row. The pivotal row may be 


‘moved to the first position if necessary by an interchange of rows in 
which case the determinant changes sign. Expanding by the first 
column, it is seen the determinant of this altered matrix is same as the 
determinant of the reduced matrix of one order less obtained by the 
omission of the pivotal row and the swept-out column. This means 
that | A | is, apart from a sign, equal to the determinant of the reduced 
matrix at any stage multiplied by the product of the pivotal elements 


DETERMINANTS 13 


used up to that stage. If the rank of A is less than n, a zero row will 
be encountered at some stage leading to a null value of the reduced 
matrix. If A has full rank, then the sweep-out process can be carried 
out to the last row giving a non-zero value to | A|. Hence, 

(i) If | A | = 0, the rank of A must be less than n. 

(ii) If | A | =Æ 0, the rank of A is n, or in other words the rows and 
columns of A are all independent. 

Example 1. If | A |, the determinant of A, is not zero, in which case 
A is called a non-singular matrix, then there exists a matrix represented 
by AW? such that 

AAT = AA =I 
Defining the matrix > 
A™ = (a) 
where aË = Aj; /\ A |, Aj; being the cofactor of aji, it is easy to verify 
that the above result is true. The matrix A7 is called the reciprocal 
of A and is defined only when | A | # 0. 

Example 2. If X is an unknown matrix involved in the equation 
XA = Y, then X = YA, provided that A~ exists. 

MV Example 3. If Bis a matrix of the form (m, n) and A of the form 
(n, n) with rank n, then 
Rank B = Rank BA 


Let s be the rank of B and r that of BA. Since B is the product of 
BA and AW, s is not greater than 7. This in conjunction with the 
earlier result (example 2, 1b.1) yields s = 7. 

Example'4. If the rank of a matrix A is r, then all subdeterminants 
of the order (r + 1) or greater vanish. This is true because there are 
not more than r independent rows or columns. Conversely, if all deter- 
minants of the order (r + 1) or greater vanish and at least one deter- 
minant of the order r does not vanish, then the rank of A is r. 

Example 5. The rank of a matrix A is unaltered by pre- or post- 
multiplication by an elementary matrix E,s(à) where E,s(à) is defined 


by (es). 
A ex = 1 for all i 


eij = for i =randj=s 


ej =O for other values of 7 and j 


The proposition is true because | E,s(d) | 0. Pre-multiplication by 
E,(x) means replacing the rth row of A by its rth row + à times the 
sth row. Post-multiplication means replacing the sth column of A by 
the sth column + A times the rth column. 


14 ALGEBRA OF VECTORS AND MATRICES 


~ Example 6. If A isa square symmetric matrix, then there exists a 
non-singular matrix B such that the matrix BAB’ is in the diagonal 
form. 

If there is a non-zero diagonal element in A, then it may be used as 
a pivot and the row and column in which it occurs can be swept out, 

leaving a reduced matrix. This method consists in only row and column 

additions or, in other words, pre- and post-multiplications by elementary 
matrices. Since the matrix is symmetrical, the symmetrical elements in 
a row and column can be swept out by pre- and post-multiplying by 
elementary matrices which are only transposes. Thus, sweeping out a 
row and column is equivalent to pre- and post-multiplying by products 
of elementary matrices which are transposes. The reduced matrix is 
also symmetrical. The above process can be carried on whenever a 
non-zero diagonal element can be found. If a non-zero diagonal element 
cannot be found, then, by the addition of a row and the corresponding 
column a non-zero element can be brought to the diagonal position and 
the above process continued. This is also a symmetrical operation by 
the use of elementary matrices, so that it follows that the matrix A 
can be reduced to the diagonal form by pre- and post-multiplying by 
B and B’ where B is a product of elementary matrices. 

It can be easily seen that any non-symmetrical square matrix A can 
be reduced to the diagonal form by pre- and post-multiplications by 
matrices which need not be transposes. 

Example 7. If the product AB of two square matrices is zero, then 
either A = 0, or B = 0, or both A and B are singular matrices. 


Example 8. If A and B are square matrices of order m and ranks 
rand s, then 


w 


Rank AB>r+s—n 


From example 6 there exist two non-singular matrices C and D such that 


cap = (“ ~) 
0.0 


where I is the unit matrix (r, r). 
Rank of AB = Rank of CAB since C ig non-singular 


= Rank of CADD=B 


I 0 
= Rank of ( ; \a, 
0:0 


where Bı = DB and hence has rank s. The last product is a matrix 
obtained by choosing the first r rows of Bı and the rest consisting of 


DETERMINANTS 15 


zero rows. Therefore the rank of AB is equal to the rank of the first 
r rows of By. If this is equal to ¢, then the number of dependent rows 
is (r — t). By considering all rows of B, we get (n — s) dependent 
rows which must not be less than the dependent vectors in a subset. 


Hence 
r—-t<n-s 


or 
t>r+s—n 


Example 9. | A B|= lall B | where A and B are two square ma- 


trices. 
There exist matrices Æ and F which are products of elementary 


matrices such that 
EAF =D 


where D is diagonal with elements dı, ---, dp so that | A| =d -= dp 
, AB = AFF™B 

| AB | = | ZAFF™B | 

Since the determinant is not altered on multiplication by elementary 

matrices, 
| AB| =|DF-B| 
= did +++ d| F™B | = dı «++ dl B| =|A||B| 
Example 10. If A isa matrix of the type (m, n), then 

|AA’]>0 ifm<n 
=0 ifm>n 


s * a matrix B of the type (n — m, n) containing 
e orthogonal to those in A (i.e, AB’ = 0) and 
= I. Consider the product 


If m < n, there exist: 
row vectors which ar 
satisfying the condition BB’ 


B) glan (B) = (445) 


Taking determinants, 
B 2 
| 2| =|4AA’| >0 
A 
* Consider the equations xA’ = 0. This has at least (n — m) independent solution 
d by an equivalent set of standardized orthogonal 


vectors which may be replace ; 
vectors. These vectors form the matrix B. 


16 ALGEBRA OF VECTORS AND MATRICES 


If m > n, then 
"Ae 
(A_ 0) (=) = AA'+0= AA’ 
where 0 stands for a null matrix. Taking determinants, 
|AA’| =|4/0|? =0 


Example 11. If A is a matrix of the type (m,n), m < n, then: 


(i) | AA’| = the sum of squares of all possible m columned deter- 
minants in A. 


(ii) Rank of AA’ = Rank of A. 
Let A = (a;;), in which case 
AA’ = (Zaiaj,) 
MQ, AzsQg ->> Aml 
|AA’] => 
QUrOmr AgsAmg + ** Amami 
summed over all n” sets of (r, s, +++, D; (r, 8, Sin, tsa. 
In this summation it is easy to see that the determinants in which any 
two of the symbols (r, s, -- *, t) are equal vanish so that the summation 
is over sets in which r #5 --- #t. Corresponding to any set (r, 8, 


+++, t) there are m” permutations, the determinants arising out of which 
can be grouped into a single determinant. Thus 


Mrdyr ess yy, e Olm Hee aint 
|AA’| => 
GmrQir H+ Amide ++ Amram Eos oF Omane 


summed over the "c» combinations (r, s, +++, t) from 1 to n, 


Mr Qs == an P 
Qr Gos +++ oy 
=z 
r 
Amr Ams s.r Amt d 


which proves result (i). 


To prove (ii), let the rank of A be r so that there are r independent 
vectors which may be marked. In the product AA’, the marked rows 
and the corresponding columns in A’ give rise to a determinant of order 
r which, by the above Proposition, is equal to the sum of squares of 
all possible r columned determinants chosen out of the r marked rows. 


DETERMINANTS 17 


Since the r rows in A are independent, there is at least one determinant 
containing r independent columns in it, so that the rth-order determinant 
obtained from the marked rows and columns in A and A’ is not zero. 
Therefore, the rank of AA’ xr. But the rank of the product AA’ 
cannot exceed the rank of A, which is r. Therefore, the rank of AA’ 
= the rank of A. 

Example 12. Defining for any Tı, ***, Un 


(ay — 8) + (z2 — BF Ht (€n — BY = dy + da +++ + da 


S; = 
_ abate tite 
z= ——— 


n 
d; = (a; — ë) i=1,2, e,n 
show that the determinant 
Se Bt nea “Be 
Sı Se see Bii 


Si Sign ct Sipe 


is not less than zero for all 7. 
This follows from the fact that the ab 


(example 10 above), where 


ove determinant is | AA’ | >0 


d də Ber dy, 
A = . wae . 
dy* da as dn? 
This proves the consistency relations to be satisfied by the moments 


calculated from any sample of observations. A 
Example 13. The jth moment of a number of variables is defined 


by uj = S;/n, where S; is as in example 12. Two constants £; and f2 
defined by K. Pearson are 
: i= 


3 Ba 
— hes- 
p28 ug? 
To show that B2 > 1 + By consider the determinant of example 12 for 
i = 2. 
n 0 e 

0 nuz nus | =O 


nyo Nuz NH4 


18 ALGEBRA OF VECTORS AND MATRICES 


Expanding, 5 
n? (uo, — u3? — p?) > 0 
Hence the result. i 
1c Quadratic Forms 
1c.1 Definitions 
The general quadratic form in n variables Fiy °**; 2, is 


ant? + aitita +--+ ainiin 


F Gait + azt? +--+ Agntotn 


2 
F anEnti + anotnt2 H- -4 annin 


where aij = aji Adopting the matrix notation, the above quadratic 
form can be written 


xAx’ 


where x is the vector (t1, +++, zn) and A is the symmetric matrix (aij). 
The matrix A is called the matrix of the quadratic form xAx’, and 
| A |, its discriminant. 


The rank of the quadratic form xAx’ is the same as the rank of the 
matrix A. 


1c.2 Linear Transformations 


Let the variables in x be transformed to those in y = (y,, 


bid Yn) 
by means of the transformation 


x=yC 


(i) Under this transformation the quadratic form xAx’ changes to 
yCAC’y’ so that the matrix of the new form is CAC’. The discriminant 
of the transformed quadratic form is Key |? | A | 
ii) It has been shown in example 6 of 1b.3 that there exists a matrix 
B, | B| = 0, such that the matrix BAB’ is in the diagonal form. If 
this matrix B is chosen as the matrix of 
then the quadratic form can be reduce 
CY? containing the square terms only. The value of | B| = 1 since 
B is a product of elementary matrices which have a unit determinant, 

By making a further transformation V | cr | Yi = 2; the quadratic 
form becomes +21? +: 29? +... Bi. 

(iii) If the rank of the matrix A is r, then the reduced quadratic 
form contains only 7 square terms. This follows, for the ranks of A 
and BAB’, where |B | #0 (example 3, 1b.3), are the same. 


, 0). 
— (ii) Every real positive d 


CLASSIFICATION OF QUADRATIC FORMS 19 


(iv) A linear transformation x = yC is said to be orthogonal if CC’ 
= I. The transformation is non-singular since |C|?=1. The quad- 
ratic form z1? +--+ an” = xIx’ changes over to 

yCIC'y! = yCC'y' = yly’ = y+ ye? eet yn” 
This is referred to as the invariance property of the distance function 
under an orthogonal transformation. Also let x1 and xs be two vectors 


transforming to yı and yə. Then 

X1x = yiCC’yo! = yiye! 
so that the angles are also invariant. Also, if x transforms to y and 
a to b, then 


(x — a)? = (y — b}? 


and 


(x, — a) — a2)! = Gi — b,)(y2 — be)’ 


1c.3 Classification of Quadratic Forms 

The real quadratic form xAx’ is said to be definite if it is positive 
(or negative) for every set of real values tı, ***) Xn other than the set 
tı =- -= 2, =0. A quadratic form which is never negative but which 
assumes zero value for some non-null values of 21, %2, ***, 2n is called 
semi-positive definite. Similarly, semi-negative forms can be defined. 

(i) The definiteness of a quadratic form is invariant under non- 
singular transformations. 

Since the transformation x = yB is non-singular, there exists the in- 
verse transformation y = xB, which establishes a one-to-one corre- 
spondence. If the quadratic form is positive (or negative) for a given 
vector x, then the transformed form is positive (or negative) for the 


corresponding y, and vice versa. Also y = (0, +++, 0) when and only 


when x = (0, °°: 
©, efinite quadratic form can be transformed 


by a real transformation matrix of unit modulus to the form 


2 
euy” feeb CnnYn 
where each c;; > 0 


It is shown in (ii) of 1¢.2 that every quadratic form can be reduced 
re terms only. No coefficient is negative, 


to a form consisting of the squa: l i 
for otherwise it implies that the quadratic form is negative for some 
values of y # 0 and hence of x. Also, no coefficient can be zero, for, 


if cs = 0, then the quadratic form vanishes for y; # 0 and others equal 
to zero, which is contrary to the assumption of the definiteness of the 
quadratic form. 


(iii) The necessa 
xAx’ is positive definite is that 


ry and sufficient condition that a real quadratic form 


20 ALGEBRA OF VECTORS AND MATRICES 


M1 cs Ay 
$0) em : a 4 Si 
Gy Etl Oe 
Let the positive definite quadratic form under the transformation 


x = yB be reduced to yy)” +- -++ Cayn?, where C1, +++, Cn are positive. 
In such a case 


41 2 
ay, > 0, 


a21 dee 


i 0 se Ü 
O @ a 6 
BAB = 
O O a @ 

|B|?|A] = cc --- c&n >0 
Therefore | A | > 0. Consider the set of values of Tı, ***, n in which 
Tn = 0. Then, from the above argument, it can be shown that 

a1 >*t Aln) 

>0 
Qn—1)1 °t Afin) (n1) 


and so on, which establishes the necessity of the condition. 


To prove 
sufficiency, let 


Mort ay 


Qi e Qü 
Since a; > 0, the first column and row in A can be swept out. The 
resulting matrix is 
a, 0 cie Ü 


0 bno nor, Ban P 
for the value of any subdeterminant including the 
first row and column is unaltered. Also As > 0. Hence it follows that 


bog = Ag/ay, > 0. With b22 as a pivot the second row and column can 
be swept out. The resulting matrix is 


where arbos = Ag, 


4 0 0 +. 0 

0 b 0 0 

0 0 C33 C3n 
= 0 0 Cn3 Cnn 


LATENT ROOTS AND CHARACTERISTIC VECTORS 21 


where @11bo9¢33 = As. Therefore css = As/A2 > 0, and so on. Finally 
the matrix A can be reduced to the diagonal form 


A, 0 ma 0 
0 As/Ay = 0 
0 0 soe D 
0 0 oa An/An-1 


ent is positive. This shows that the quadratic 


where each diagonal elem 
to the positive definite form 


form xAx’ can be transformed 
sgt N pn e 
Any? + Zy bet 
au Ai uy An-1 j 
which establishes the sufficiency of the condition. It should be noted 
that sweeping out is equivalent to pre- and post-multiplication by a 
product of elementary matrices which are transposes. This product 
of the elementary matrices provides the transformation matrix. 
(iv) The necessary and sufficient condition that a real quadratic form 
xAx’ is negative definite is that 


Ar <0, 42>0, 43 <0, 


This is true, since _xAx’ is positive definite. 
1c.4 The Latent Roots of a Matrix and the Characteristic Vectors 
Let xAx’ be a quadratic form in n variables. Let us find a vector x 
which maximizes xAx’ subject to the condition xx’ = 1. This is ob- 
tained by differentiating * 
xAx' — xx’ — 1) 
° 


* The following rules of different! 
with respect to all the variables) wil 
Since xx’ = 22 + ae? +--+ Tnn 


( a A A Z) (ag? HeH ate?) = ty +t ty Tn) 


iation with respect to vectors (i.e., simultaneously 


ill be useful. 


dm’ am’ | On 
Therefore 
2 xx’ = 2x 
ox 
Similarly 
2 Ax! = 2xA 
Ox 


Accessioned No. delete 


22 ALGEBRA OF VECTORS AND MATRICES 
where A is a Lagrangian multiplier. The equations are 
xA — Xxl =0 
x! =] 


In order that a non-null x ma: 


y exist, A must be chosen to satisfy the 
determinantal equation 


|A—xI| =0 


This is called the characteristic equation of the matrix A. Any value of 
which satisfies this equation is called a latent root, and the x corre- 
sponding to a given is called a characteristic vector. 

(i) The degree of the characteristic equation for roots other than 
à = 0 is equal to the rank of A. 

This can be verified by expanding the determinantal equation. The 
coefficients of "7-1... A, °, being the sums of determinants con- ` 
taining more than r columns and rows, will be zero if the rank of A is r. 

Gi) Tf A is real, all the roots are real. 

Let x + zy be the characteristic v 


ector corresponding to a complex 
root A. Then 


(+ ty)A — XK + zy) =0 
Multiplying by (x — ty) 


AG? + y?) = & + iy)ACe — ty)’ 


= KAx’ + yAy’ + u(yAx’ — xAy’) 


= xAx’ + yAy’ 
Hence » is real. 

(iii) The value of the quadratic form for a given characteristic vector 
x is equal to the value of the latent root A associated with it. 
Since 


xA — xI = 0 


$ 
Then 


xAx’ = dxx’ = 

The maximum and minimum val 
xx’ = 1 are then the largest and t 
(iv) If the quadratic form is po: 
are positive. This is true, since 
quadratic form for some values o 


(v) The characteristic vectors 
roots are orthogonal. 


ues of xAx’ subject to the condition 
he least latent roots. 


sitive definite, then all the latent roots 


the latent roots are the values of the 
f the variables, 


corresponding to two different latent 


è 


LATENT ROOTS AND CHARACTERISTIC VECTORS 23 
Let 1, \2 be two roots and x, y the corresponding vectors. Then 
xd — yx = 0 
yA — kw =0 
Multiplying the first by y’, the second by x’, and subtracting, 
(A, — de)xy’ = xAy’ — yAx’ =0 
From this it follows that xy’ = 0, since M # Xo. 
(vi) There exists an orthogonal transformation which transforms a 


quadratic form xAx’ into yAny’, where An is a diagonal matrix con- 
taining all the latent roots of A. 

Assume that X; is a matrix of the form (i, n), the rows of which are 
the characteristic vectors corresponding to the latent roots M, «++, Ai 
not all of which need be different. Also, let the rows of X; be orthogonal. 
We will show that under these conditions there exists a vector x which 
is orthogonal to the rows in X; and is a characteristic vector corre- 
sponding to the latent root \y41. 

Since the row vectors are orthogonal and normalized 


X,X/ =I 


Since the jth vector satisfies the relation 
xA — NX = 0 
it follows that 
X,A —A:X; =0 
where A; is a diagonal matrix containing ^, «°°, à; in the diagonal. 
Let us find a vector x, (xx’ = 1), orthogonal to the rows in X; (i.e., 


Xx’ = 0) and which maximizes xAx’. Introducing the vector p = 
(ui, +++, us) of Lagrangian multipliers, the quantity to be differentiated is 


) xAx! — Xx — dx’ 
Differentiating, we obtain 
oxA — 2pX;— Dal = 0 (1¢.4.1) 
Eliminating x and p, the equation giving is 
A-M: = 
mm) i Mh cal =O) (10.4.2) 


Xi 


Gi, n) 


24 ALGEBRA OF VECTORS AND MATRICES 
Pre-multiplying this by 


I i 0 
(m,n) $ (n, i) 
=d = (AM) A1 A), 
xe Grea ( Desg ) 
Gy f ea 
we get 
A< K 
=|A-M|=0 
0 $ 
Equation (1¢.4.2) can be written 
|A -M| 
m 9 
(=A) AA 


Hence all à satisfying (1c.4.2) are the | 
other than those already considered. Let us consider the latent root 


Aiqa and solve for x and p from the set of equations (1e.4.1). Repre- 
senting the solution for x by X;41, we have 


atent roots of | A — M| =0 


Xi = px; = AKi44 =0 


Xi = 0 
Or, multiplying by X “a 


Xip AX,’ = pX;X,’ = AXi X; =0 


i.e., n 

Xi X;'A; — pl — dx;4,X/ = 0 
i.e., 
Therefore 


p= 0 
which shows that Xi41 satisfies the equations 


Xil — han = 0 š 


XX 44 =0 
Therefore x;; is a characte 
orthogonal to X;. 


Starting from the first characteristic vector, all th 
structed so that there exists an ortho; 


ristic vector corresponding to Ara and is 


e n can be con- 
gonal transformation X n Such that 


XnA = AnXy 
and hence 


XnAX,! = An 


PAIRS OF QUADRATIC FORMS 25 


Corollary. Tf N; is a root of multiplicity r, then there are r and only 


r orthogonal vectors satisfying the equation 
x(A = vr) =0 


so that the rank of A — MI is (n — 7). 

Corresponding to a latent root A; of multiplicity 7; there are, by the 
above result, 7; orthogonal characteristic vectors. Since (rı + 72 +++) 
= n and there can be only n orthogonal vectors, it follows that there are 
only r; characteristic vectors correspnding to A; Hence the rank of 
(A — XI) is (n — 7). . 

To obtain the characteristic vectors corresponding to the root A; of 
multiplicity r; the best method is to find the space orthogonal to 
(A — I) and choose any set of orthogonal vectors in this space. 


1c.5 Pairs of Quadratic Forms 
Let A and B be two symmetric mat: 
x(A — B) = 0 
antal equation |A = »B| =0. If | BI 
e x satisfies the equation 


rices and x a vector such that, 


where } satisfies the” determin 
= 0, then B exists, in which cas 
x(AB— — M) = 0 (10.5.1) 
and is the latent root of the matrix ABW. The determination of the 
vectors satisfying (1¢.5.1) thus reduces to the case considered in the 
previous section. 
(i) If dy and Xz are two different roots, then 
x, ABT — MX = 0 


or 
xå- AxB =0 


XoA — AXB = 0 
& 
Multiplying the first equation by Xo’ and the second by x,’ and sub- 
tracting, r 
êi ~~ Ao) x1 BX? =0 or x, Bxe =0 
(ii) If à is a root of multiplicity 7 then the rank of (AB7 — M) 
can be shown to be (% — r) asi the previous section. The number of 


as in 
independent vectors satisfying the equation x(AB7 — M) = 0 isr. 
The vectors in this set may h that any two vectors x and 


be chosen suc 
y satisfy the relation x By’ = 0. Thus, corresponding to the n latent 
roots we obtain n vec 


tors which may be represented by a matrix Xn 


26 ALGEBRA OF VECTORS AND MATRICES 
satisfying the condition that 


X,BX,' =C a diagonal matrix 


Let the leading diagonal of C contain the elements Gr, Cae 


(iii) If A, denotes the diagonal matrix containing M, ---, A, in the 
diagonal, then 


X,AB™ — A,X, =0 
or 


X,A — AnX,B =0 
Multiplying by X,’ 


XnAXn’ = AnX,BX,! = AC 


This shows that the transformation x = yX, transforms the quadratic 
forms xAx’ and xBx’ into 


MY eb CnAnYn? 
and 


crys? HeH nyn? 


(iv) If B is positive definite, then the transformation Veyi = Ý; 
carries the above quadratic forms to 


MY? Hee mY, 
and 
Y? He YR 
(v) If the quadratic form xAx’ is never negative, then nod is negative. 
Example 1. If A and B are symmetric matrices such that B is positive 
sat and (A — B) is positive or semi-positive definite, then la| 
2 | B|; 


Consider the equation 


|4-B-B|=0 
where no root is negativ 


e since (A — B 
definite and B is positive 


definite. Since 
|4- B0+N|=0 


it follows that | ANVEN = aa a a4 An)» where Ai, do, «++, 
An are the roots. None of the factors in the product (1 + A) one 
(1 + An) is less than one so that i 


. oye 4 . oye 
) is positive or semi-positive 


REDUCTION OF AN ASYMMETRIC MATRIX ar 


which proves the result. This result also shows that the matrix A is 

also positive definite. 

? Example 2. The rank of the quadratic form D(x; — 7)”, where z 

is the average of ti, 22, °t tm is (n — 1). (Use example 4, 1a.4.) 
Example 3. Consider the matrix (x3) of measurements, į = 1, 2, 

py = Ly 2; FP If 


n 
x tij = NE 


j=1 


Su = Elt Z 1) (tug — Fu) 


then the matrix (Stu), (6 & = 1,2, Ph is 

(i) positive definite or semi-definite if n 2 P; 

(ii) positive semi-definite ifn < p; and 

(iii) positive definite if the rank of (aij — 4) is equal to p and semi- 
definite if it is less than P. 

(Use examples 10 and 11, 1b.3.) 

Example 4. The solution of 


| did; — dai; | = 0 
is 
\ = Sadd; = dAd’ 


+, dy). 


where A = (aj) and d = (d °° 
trices such that with respect to a 


P. 
Example 5. If A and C are two ma 
symmetric definite matrix A 
AAC’ = 0 


then the rows in A are independent of the rows in C. 

If any vector x in ‘A is dependent on the vectors in C, then xAx’ = 0, 
which is impossible since A is a definite matrix. 

Example 6. Consider the quadratic form in yı; °°") Ye 


n 
E (m + dive + dys be diy)? 
i=l 
.-, dn are as defined in example 12 of 1b.3, and deduce the 
f When the variable x is continuous, the 
We thus deduce the consistency 
f any distribution. 


where di, - 
result about the moments. 
summation is replaced by integration. 
relations to be satisfied by the moments 0 


ymmetric Matrix 
is often necessary to determine the powers 
alled the generation matrix. If A represents 


1c.6 Reduction of an AS 


In population studies it 
of an asymmetric matrix ¢ 


28 ALGEBRA OF VECTORS AND MATRICES 


the generation matrix of the type (k, k) and fo, fn represent the initial 
and the nth generation frequencies in some well-defined classes, then 
fau = fA? 


The calculations can be simplified by the following steps. Let i, --- 
Ap be the distinct roots of the determinantal equation 


[4 -M|=0 


with the multiplicity of the root \; being equal to m;, Bm; = k. Let it 
be possible to find m; independent solutions Xi) ***, Xim; Of the equa- 
tion 


, 


x,(A’ — MI) = 0 (1¢.6.1) 
If P stands for the matrix containing the vectors x;;, then 
PA' = AP 
where A is the matrix containing the latent roots allowing repetitions 
and arranged in the same order as the corresponding vectors, If Xa, 


“++, Xp are any characteristic vectors corresponding to the distinct 
roots \1, +++, Ap, then they are all independent. If there were a relation 


1X1 +--+ px, = 0 (1¢.6.2) 

where the terms with zero coefficients are omitted, then from the rela- 
tions x;A’ = Xx; it follows that 

Mpx eee ApPpXp = 0 (1¢.6.3) 

Using (1.6.3), one variable in (1¢.6.2) is eliminated and a new relation 

obtained. This gives rise to a relation similar to (1¢.6.3). After re- 

peated eliminations the relation (1¢.6.2) reduces to an absurd result 


that one of the vectors is zero. Therefore no such relation as (1¢.6.2) 


is true. From this it follows that P is a non-singular matrix in which 
case 


A = (PAP) = Q7AQ 
where P’ = Q~}, Hence 
A? = Q7*AQQ7AQ = Q712 
and generally 
A” á Qa A"Q 
so that an-easy rule is provided once the P 
a simple representation breaks down 


DETERMINANTS, RECIPROCALS, AND SOLUTIONS 29 


simple way from generation to generation. In fact 

Xijfn’ = N” Xijfo (10.6.4) 
so that, knowing the initial values of these linear functions, the vector 
f, can be solved from the equations (1c.6.4). 


id Numerical Appendix 


1d.1 The Evaluation of Determinants, Reciprocals, and 
Solutions of Equations 

‘The theoretical expression for the value of the determinant given in 
1b.3 is not convenient for practical computations. The method of 
pivotal condensation will be useful in (i) determining the value of a 
determinant, (ii) solving linear equations, and (iii) obtaining the recip- 
rocal of a matrix. These three techniques are simultaneously illustrated 
in the example below. Only the relevant computations need be retained 
in any problem. In this illustration & non-symmetrical matrix is 
chosen. In most of the statistical computations symmetric matrices 
are met with. This results in a certain amount of reduction in compu- 
tations, and the layout of a simplified procedure in such cases is dis- 


cussed in the text. 
References 


Levi, F. W. (1942). Algebra, Calcutta University. 
TurxguL,, H. W., and A. C- Arrxen (1932). A 
kic, London. 


canonical matrices. Blac! 


n introduction to the theory of 


CTORS AND MATRICES 


BRA OF VE 


3 
4 


ALGI 


30 


‘UWINOD gs14 ¥ Mo Suidaoas Aq 10 Ayun 03 YUATOIYa0d gs14 oy} Duronpos Aq Joy} 
‘POUYO St MOI MIU V IIAIUIYA Payooyo oq Lvu SIT, SUII 1970 oY} JO WINS OY} SI JUIUƏJƏ Ysel] IY} FV} Áyədoad əy} sey MOI 
paalop AUY 'UWN[0 V1}xə UV Sv p381} SI sry} SUOT}CIOdo YUONbasqns 10, ‘UUNO JSV; OY} UL UIJPIM ST (MOJ IIIQ) SMOL [BIITUT IYF JO 
Yous ur s4uəwəjə I} JO UMS OY} “J18IS OL, 'SUOIEMIJVI ƏY} [[v Jo ADVINDDG səmsuə UUIN[OD se; oY} ur popod yooyo wns y (1A) 
‘SIOII9 YO-SuIpuNoL oy} uo yooyo v dəəy 0} suoremə 
-[¥9 quonbasqns ur (əxour əuo A[[unsn) səejd ərowu dəəy 0} 1949q SL Q Mg 'sSəmJy yuvoyrusis moj AUO SBY XZB IJ} “4y1vys O LZ (A) 
"XLI}VUL [BIOIdIOaI OY} BAIS (G PUV ‘g ‘Z ‘9 SuUINIOD UT) «ap.0 
984934 OY} UL SAMOI INOJ SV] OY} SNYT, “UO OS puv ‘pauteyqo st OZ AOI ‘OZ AOI UL tr puv ex Jo sənjea oy} SurynyYsqns Ag ‘poureygo st £x 
10} SUOTINI[OS oY} BUAI OE AOI ‘Qg AOI UT Fx JO oN[vA s} Surynyysqns Aq ‘puv “r IOJ SUOIJNJOS OY} SIAB QF MOI eoad ayy, (At) 
“XLIZEU [BOOIdIOaI əy} JO susuwa əy} 
WOJ [[IM SUOTN[OS JO SOS INOJ OY, “XUAVUL Qun v JO SULINOD oIv S}UDWI[a PULY-JYSII əsoya suoTyenba snoauaZsouroy-uoU zo suornnjos 
oY} LO} popuozUl ƏV G pur ‘g ‘4 ‘9 suUIN[Od pur ‘suorenbə snoouoZowoy-uou zo suonnjos oY} VAIS e UTUNIOD 0} dn SUOT}BINIVD (m) 
‘papuo əq prmoys p uwunjos 
puosaq suorepmnojwo əy} ‘pəpəəu st guvuruəgzəp Əy} JO onjva oy} ATWO JT 'SUOENDƏ JO XUJBWU I} JO JULUTIA}Op I} JO ƏNJVA oy} SI 


£98T00'0 = OTSZE'0 X TO9TS'0 X GTLST'O X OSZZT'0 


S}USW9[9 INO} oy} Jo Jonpoid əy} pus ‘saos [vyoaid ov OF pur ‘Og ‘OZ ‘OT sMoy (1) 
‘poyodar st uoryviedo IOA oy} ‘ET PUV ‘ZT ‘TT SMO UI x1IyeU PIMPA I} YJ 
SUIVIS “Ppourezqo IV ET PUV ZI SOI ‘FQ put Ü SAOI ur Jusu 4YsIy OY} Fureur Aq ‘Apepurg 0197 ST JUI SIY OY} YONAN UI 
poureyzqo St T] MOI ZO MOI WOI ZUIVIJANS puv T8ZT'O Aq OT Moa Sudy mu Ag -quowreye peyoatd ysayg I4} st ‘jqeg IQ} Ut pəurpəpun 
8ZZT'O “WUoWd]a PSIH IYJ PUV MO eJoAId V par[vo st sy, ‘Apun 04 yuəwaə 4SIY Ə} FUWNPpƏ Aq J0 MOI WO] pəuregqo st QT Moy (1) 


OPPI 910,7, Ur suormpmduop Iy} uo sajo N 


gl 


G2Z9S 96 83}80'6— OSTSP I 0S108'TI PEPLS OL 89680 TT i i i I /O1 
PP8PI'S 9I86GI'T— <s8LP6's TPE68 T o¢96e° I — | O6LOF'D i ; I 0G 
sp 36469°% 06909°€  8L00b'% LLO6S°L — @106I'T GE890'°S i ji 0E 
7 cg6c9'h — | G6999°S 860£9'1— elele lt — TL9’ — | EFSEOS — I OF 
O _———. ee 
z TARL TI = I SLII9°O— $8IZ6}'0 — 9LL16°O — | 80T0T'T — | OTS4E'0 IE 
5 ITTE6°8 i 1}6Z9'} ZeS18'’9 — Onder yh T86€0°9 G6ZSE IT- I 0E 
2) — 
~ Bee oo ji i 02094°T — 8ZZgEg'0 — | Z67080 — | TE961'0  SIGST'O GG 
Z 13626 T : T 1Z93°T — T0LS6°0 99F0E T C2262 0— 109160 Ig 
ys Lesse o : : LICHES 23209 — | O9TIT'IT — | 9688E'T €769 0— T 03 
H ii E 
S isrer' 0 — I à i 6ELF9°T — | 2gz299'0 — | 60ZS'0 eSLTEO'O— 66SEZ'O gI 
= 107102 i I i cress’ 0 — | LeePr0'l 98P80°0 Zlzc0°O SISte'O at 
a 96990°0 , i I 9TEF0'I — | 8080Z°0 — | 00092°0 666Z1°O0— 612L8T°0 EL 
z 160P ST i ; i GESTIS G8P98 "9 6919 T GLLOT°T  S98IF O— I or 
m 
gg 88EG E I $ i i ceg8'0 69E8°0 £F0Z 0 €zSt°0 2020 +0 
H 8018'S i I i j GIPS I 9S0r'0 Yo01 0 GLIZ'O = FEFO'O £0 
^A GE e : i i €119°0 £0470 9610°0 GPvel O T8310 60 
7 00966 i : ‘ I 0£7}8`0 9102 °0 Pert’ O 80S0'0— 8zzI°0 10 
Z. 
= Ol 6 8 L 9 G F € lé T 0 
z 
A yoay _— SJUIUNT ‘ON 
iain poo0ididay ay} of Xo py NUN pungli suoyonby maur] fo zım jy mo 


a 


‘oqo ‘xLIyey [vooidioay “QUBUTUIOJEq B JO UOIZEN[VAT oY} 10} suoleynduoy ‘Tp ATaV I, 


GHAPTER 2 


Theory of Distributions 


2a Some Analytical Methods in Distribution Problems 


2a.1 Binomial Distribution 

Binomial distribution is the simplest problem in the theory of distri- 
bution. The observations consist of n independent stochastic variables, 
each of which can assume two values, 1 and 0, with probabilities p and 
q (p +q = 1). If we attach the value 1 to success and 0 to failure in 
a trial, then the sum of observations in n independent trials gives the 
total number of successes. What is the probability distribution of the 
number of successes in n trials? 

If we denote by r the number of successes, then 

P= my Hb te bos sb oy 

where x; = 1 for success and 0 for failure. Since the events are inde- 
pendent, the probability for any series of x with r, 1 and (n — r), 0 is 
pg’ *. To obtain the probability of r successes we have to sum up 
the probabilities corresponding to all possible series of x containing 1, 


r times and 0, (n — r) times. This number is (") 


ji 
binations of r out of n things. Since each such series has the probability 


pq’, the total probability of r is pq’. This is the rth term in 


, the number of com- 


the binomial expansion (p + g)”. 

If n trials give rı successes and an ind 
T2 successes, the probability of success being the same for both sets, the 
probability of r = (r; + T2) should be the same as that of r successes in 


n = (nı + ng) trials. This can be formally derived in the following 
manner. The probability of Tı and rs is 


P(r, T2) = (") pg" ("") pq? 
Ty T2 


= (") (i p’ tregritns=ri=ra 
rı/ \re 


32 


ependent set of no trials gives 


BINOMIAL DISTRIBUTION 33 


Therefore 


D Pr, 72) 


r=ritr2 


gr E ag oad 
raritre \T} T2 


P(r = r + ra) 


I 
3 
S 


Thus the sum of two binomial variates is also a binomial variate. 
Corresponding to a probability distribution there is a distribution 
function which gives the probability of a variate’s assuming a value 
less than or equal to an assigned value. If P(x) denotes the probability 
distribution, the corresponding distribution function will be denoted by 
F(z). 
In the binomial case 


Fo) = È POs) 
s=0 


n! r! (n—r-— 1)!(r — s)! 
= > —— g= p? 
ri(n—r— 1)! sl(r — 8)! (n—s)! 
! a fy 
= te {| ( Vea = pr dt 
rin —7r—D! o MS 


1 ` 
p a) (1 — "(p +g)" di 
ri(n— r — 1)! o 


Putting t = 1 — 2/4, the above expression becomes 


1 q 

os j ar TTI (1 — x)” dx 
rin — r — 1)! ¥0 

This is an incomplete beta function which is extensively tabulated in 


Tables of Incomplete Beia Function (edited by K. Pearson). If r is 
small, then each term in the above summation can be separately cal- 
3 


culated and added to obtain FQ). 
Example 1. For binomial distribution (p +9”, 


F T _ Pa ¢ a 
» (2) =p v(2) ae Na n 


n 


34 THEORY OF DISTRIBUTIONS 


where E stands for expectation, V for variance, and cov for covariance. 


Example 2. If pis the tth corrected moment of the observed number 
of successes, then 


d 
Hiz2 = pg [= kiyi + natt Du} (Romanovsky, 1925) 
P 
Hence 


bs” (q — p}? 
fi === 

Be npq 

m 1 — 6pq 
b= =3+4 

He npq 


2a.2 Multinomial Distribution 


If there are k mutually exclusive events with 
Tk, (Zr; = 1), then the probability of occurre 
first kind, ng of the second, «+ 


probabilities 71, mo, +++ F 
nce of nı events of the 
+, ete., in a total of n independent trials is 
n! 
P(n, na, +++, m) = —— m.. 
Nile ng! 


+ myth 


The product 2,” -- + m,"* refers to the probability of events oc 


curring 
in some order; n!/n,! +++ ny! represents the number of arrangements of 
nı things of one kind, ny of another kind, and so on. Therefore the 


total probability of the desired number of events of the various kinds is 
the product of these two expressions, the argument being similar to that 
used for the binomial. The above probability is a term of the multi- 
nomial expansion. 


(mi + ra ee m)” 


Since 
n! i ee ? 
eead neme" (ry feet ay) 
os n! ni nk 
gi Ti mlc ty} ae = In:P (m, tty Ng) 


Il 


jt 

Te (mi H: r)” 
On; 

= nri(mi foes} a) 

Znn;P(ny, ++ 


ð 
ng nar, +... + m)” 


I 


ð 
Uag E, t+) my) os) 


ll 


n(n — Lrirj(m p.. e+ m)” 


THE POISSON DISTRIBUTION 35 
Similarly 
ə 2 
Ti cm En:P(n, ++) Mm) = En? P(n, +++, ne) 
= nrili +e mp)” T! 
+ n(n — 1r? Ht m) 


These results give 
E(n;) = nri 


E(ninj) = n(n — 1)riTj 
Elnê) = nr; + n(n — 1r? 
Therefore , 
Vn) = En?) — En)? = nm — Ti) 


cov (n; nj) = Elnm;) — E(n)E(nj) = -nrt 


which are similar to the results obtained for the binomial. From these 
results the variances and covariances of linear functions of frequencies 


in k classes can be obtained in a simple way. 
Viam ++ ate) = 312V (ni) + 222l; cov (ninj) 
= Dl?nr;(l — mi) + QDI; (— na 473) 
= n{ leas — (Elir) ’} 
Similarly 
cov {hm +-+ lame), (um Feet myn) } 


= n{Slma; — (Slim) (Zmim)} 


Higher moments of the multinomial can be derived by extending the 
differentiation processes, but they are not of much use in practice. 


2a.3 The Poisson Distribution ; f 
This is a discrete distribution where the stochastic variable assumes 


values 0, 1, 2, «-+, with the probability for r equal to 
3 2 3 ’ 


36 


THEORY OF DISTRIBUTIONS 
where y is a parameter. Since, 


a= yt 


o r! 
T 
p 
2r— = p— ë = pe 
r! du 
we 
Zr? — = p — (ye) = pe + pet 

r! du 
These give 

E() =n 


V0) =n 
so that the mean and variance of this distrib 


ution are equal. The higher 
moments can be derived from the relation 


dur 
Meza = tupi + p — 
obtained in a manner similar to the corresponding relation in the bi- 
nomial distribution. This gives 


M3 =p Mg = u + 3p? ete. 
Hence 
il 1 
fi=- B=3 +- 
u H 
If ri, r2, 
distribution, 


» Te are independent variates from the same Poisson 


Titratss-trp 
Pray ray +++, 74) = ota WE t 


Ty rg! +++ py! 


PO = ri tre pet r) 


1 
= ey) E 
Tilrol +++ rpl 
k (ku)” 
= gy a i 
eee Ar 
which shows that the sum 


r! 
wh i of k Poisson variates, each with 
is itself a Poisson variate 


) parameter u 
with the parameter ku. 
The conditional distribution 
P P 
P(r, 12) °° 5 Tey | r) = ifs ei 


P(r) 


2o; Tx) 


THE POISSON DISTRIBUTION 37 


is a multinomial with index r and probability in each class equal to 1/k. 


If ry, «++, rg are Poisson variates with parameters 41, +*-, He; then 
ri Tk 
P(r sens rg) = a tae) AL eae Hk 
” i ri! ry! 
ri Tk 
nny Hk 
P(r =r +t r) = OPS 
ri re! 
ww 
PT a ah 
r! 


where u = py +*+- + ur, Which shows that the sum of k independent 
Poisson variates is in general a Poisson variate. This is true in the case 
of a binomial variate only when the binomial proportions are the same. 


The conditional distribution 


P(r, +7) 7) 
Pip tel a 
TORO) 
E ril eee re! Su Th 


is multinomial with probabilities #1/#, `° ux/u in the k classes. 


In general, any multinomial distribution 


can be written as the ratio of 


ni nk 
gmat (ni) aan gra (2TH) 
Poy erm) = e il 
to 
n” 


P(n =m beet m) = 


which is the relative probability of k Poisson variates with parameters 
my = nm, = 1,2) °°° k), subject to the condition that the sum of the 
E= iy aee ean 


variables is equal to n. 


38 THEORY OF DISTRIBUTIONS 


The distribution function for the Poisson distribution is obtained 
below. 


N p w 
F(r) =e "(i+ R448) 


e # 8 (ro) 
= OS eae eon f e = * da 
s Sir — s)! Jo 


ce 


=— | e ”(p +r) dx 
0 


1 ao 
=f e Yy” dy 
IDA 


which is the incomplete gamma inte 
edited by K. Pearson. 


The Poisson probability can be deduced from the binomial when n, 


the number of trials, is large and p is small. For instance, the probabil- 
ity for no success is 


np\" 
PO) = g = (1 — p)” = (1 im r2) re PR = er 


Nn 
a r+l n—=r—1 
P(r + 1) A z 


Il 


gral, tables for which have been 


” 
w= H 
rely #1 


where u = np. The successive terms of the binomial then tend to 


9 


ee ` ne | a 
Ce Py CP vas 
1! 2! 
yielding the Poisson series. 
the number of trials is indefi 


be used. 


Thus, when the probability is small and 
nitely large, the Poisson distribution may 


2a.4 Normal Distribution 
This is a continuous distribution with the 


À r probability differential of 
the stochastic variable x equal to 


N(u, o) dx = 


e7 @—H)/20 Jy 


1 
oV 2r 


NORMAL DISTRIBUTION 39 


The rth moment of this distribution is 


] ia 


OM Zi! 


(on = pe OPO de 


= = yo" dy puttingýy=%— p 
oV ZT J o 


= 0 if r is odd 
l 0 j 
a Í a e dz putting z = y* if r is even 
oN 2r Jo 
2g? (r+1)/2 r + 1 
= 2 r (jere - Me @ 
oO 2T “ 
o(r — 1)! 


( = 9 1g(r—2)/2 
2 


whence we have the following results: 


w=o m=O m7 30° 


B, = 0 b2 = 3 
variates with mean values mı 


If x and y are two independent normal 
then their joint distribution is 


and mo and standard deviations o1 and o2, 
const. e~ 22 dx dy 


where 
2 
m itn Ga mag) 
Q(x, y) ii 9 P go" 


i) = 20 mo)” 


o? + N0” 


[(w — m) + Y = mo)]” P [(x — ? 
p nee 7 pean 
oi? + oa 


where c12? = Moz. Make the transformation 
=g y= m= AY 

so that the joint distribution of u and v becomes 

const. e— 42(u”) du dv 


where r j 
Gu = ma ima)” U= m + Amo) 


_ ak 
W, U) = 2 
Qu, 0) a? + 09" o? + No" 


The distributions of u and v are independent as their joint distribution 


40 THEORY OF DISTRIBUTIONS 


turns out to be a product of two functions. The distribution of u is 


const. e7 ¥4(u—mi— m2)7/ (12 +02") du 


which shows that u = (x + y), the sum of two normal variates, is itself 
a normal variate with mean equal to the sum of the means and variance 
equal to the sum of the variances. In general, the sum of k normal 
variates is distributed as a normal variate with mean equal to the sum 
of the individual means and variance equal to the sum of the individual 
variances. 


Example 1. If x is N(u, o), what is the distribution of x?? 
The distribution of x is 


H 


$ 


oNV 2r 


@7 (F—#)?/207 dx 


Let y = x” so that dy = 2z dz. 


The range of y is from 0 to %, and 
that of æ is from 


—* to œ. Corresponding to a given y there are two 
values of x (+2). The probability density at +x transforms to 


1 e7 UV inn) /20_ UY 
oV 2 2Vy 
and that at —2 to 
1 


e7 Ut 2V inn) /202 _AY 
ovr 2Vy 


so that the total probability differential of y 


is the sum of the above two 
expressions, i.e., 


V un/o® —Vmm/o? 
e7 0/20) — tu?/202) (€ +e ) 
e m e. 


= d; 
oV 2r 2y. 4 


; eT UtH)/202 cosh Vyuo? 


oNV 2r Vy dy 


y 


If u = 0, the distribution of yis 


I : 
=y/20?  — 1 d 
; oNV 2r á # 
which is G(1/20?, 14) defined in 2a.5. 


2a.5 Gamma Distribution 
The gamma distribution is defined by 


aP 


G(a, p) dx = 
ulead - 


ee yp—1 dx 


BETA DISTRIBUTION 41 


where the range of z is (0, ©), æ > 0, and p 2 1. The rth raw moment 
of this distribution is 


fs eet yp trl dx = Tet 1 
o T(p) T(p) œ 


so that 
P 
oe 


ia- V@= 


If æ = 1, the mean and the variance of the gamma variate are equal, as 
in the case of a Poisson variate. 


Let x and y be two independent gamma variates with parameters 
(a, p) and (a, q); then their joint distribution is 


const. e~*@t2?—1y? da dy 


Put x =rcos?@ and y = r sin? 0, so that dedy = 2r cos 6 sin 0 dr dé. 


The distribution of r and @ is 


const. e7@7r?t4— (cos 6)°?—* (sin 6)?2— dr d0 


—ar,pta—l dr, which is again a gamma variate 
n general the sum of k gamma variates 
+, (a, pe) is distributed as a gamma 
+--+ p,). The distribution of 


and that of 7 alone is ce 
with parameters (a, p +4). I 
with parameters (a, 71); (Œ, p2), °° 
variate with parameters (œ, pı + P2 
z = x/(æ + y) is that of cos? 0, i.e., 
T(p)T 
1_pag-p td Bod = see 


A special case of the gamma distribu- 


—- 6 


B(P; 9) 
which is B(p, q) defined in 2a.6. 
tion is the x? distribution, 
San ace 1k 
x2(k) = const. eH XP G2)G OP ad? = G5) 
2 2 
which is specified by one parameter k known as the degrees of freedom 
of x”. 


2a.6 Beta Distribution 
A stochastic variable 
distribution 


in the range (0,1) having the probability 


e7 (1 — a) dx 


B(a, b) de = Fy 


is said to follow the beta distribution with parameters a and b. The rth 


42 THEORY OF DISTRIBUTIONS 
moment about the origin is 
a+r, b 
1 fea pe x) dz = Bla + 7, b) 
B(a, b) B(a, b) 
The mean and variance of z are 
a ab 
and — 
a+b - (a+b/Pa+b+1) 


If v and y are two independent beta vari 
and (c, d), then their joint distribution is 


ates with parameters (a, b) 


const. x 1(1 — x)?=ye=1(] — y)" dx dy 


Put u = z and z = zy, so that (x, y)/A(u, z) = 1/u. The joint distri- 
bution of u and z is 


const. u2—°—4(1 — y)o—1ge—1 (u — 2)?! du dz 
= const. (1 — u) (u— a duds ifa= e+d 
Integrating over u, the distribution of z is obtained as 


ae 2—1(1 — gota gp 

B(c, b + d) 

This shows that the product of two beta variables, with parameters 
(a, b) and (c, d) such that a = (c + d), is distributed as a beta variable 
with parameters (c,b+d). In general, the product of beta variables, 
with parameters (ar, b1), (az, be) «+ + (ar, bx) such that ai = G44 + biga), 
is distributed as a beta variable with parameters (Gx, by =f ras + dy). 


2a.7 Cauchy Distribution 
The Cauchy distribution is defined by 


1 dx 

a ee 

w1+ (s— p)? 
where z ranges from —o 


to». Thisisa symmetrical distribution with 
the modal value at z = 


$ls 
E(x) = Aus zde 
T Jo 1+ (z — p)? 


This integral does not exist but has the principal value y since 


lim 2 ve oh 


SE a aT 
rer diy Lifo — D exists 


CAUCHY DISTRIBUTION 43 


The second moment 
5 1 pt? y 
re = = f z=% 
TJ 1+ (Ee 
so that the Cauchy variable has infinite variance. This is an example 
of a continuous distribution for which the mean and variance do not 


exist. 
Consider two independent Cauchy variates x and y with parameters 


uı and pe. Their joint distribution is 
1 dx dy 
Pit (@— ml + @ e)? 
Putting u = x — m andv = y — y2, We find the distribution of (u + v)/2 


and hence derive that of (x + y)/2 by the substitution u tv=(a+y 
— pı — pe). Transforming from u, v to U, 2 connected by the relations 
u = u and u + v = 2z, the joint distribution of u and z is given by 


A du dz 
const. Fp wy (1 F C= 9?) 
1 =. iL 
(FF z) 474+ 42") 
4zu 42? 82? — 4zu 4z? 
pr tar cael 
Ipu 14a? 1+ (22z— u)? 1+ (2%2-—u) 


Integrating term by term with respect to u from — to %, we obtain 


2z log (1 + w’) — 22 log {1 + (22 — u)*} 
f: ie 


dz | 
const, —_———_ 
(422 + 4) 42? 


+ 42? tan u + 42? tan (u — 22) = const. ——> 


=o 1+2 
The distribution of z = (e + 9)/2, obtained by changing z to z — 
(u1 + u2)/2 in the above expression 1S 


which shows that the mean of two Cauchy variables with parameters 
uı and po is distributed as a Cauchy variable with the parameter equal 
to the mean of the two parameters. In general, the average of k Cauchy 
variables with parameters p1, °**» Pk is distributed as a Cauchy variable 


44 THEORY OF DISTRIBUTIONS 


with the parameter (u, +---+ ur)/k. Tf all p are the same, we ob- 
tain the interesting result that the distributions of a single observation 
and that of the mean of any number of observations are the same. 
2a.8 Pearson’s P, Distribution 

If a stochastic variable z has the probability density f(x), then the 


variable 
f s@ a 


which represents the probability of an observation’s being less than or 
equal to zx, has probability density unity. Since 


dy = = fs) dx = f(x) dx 


f(z) dx transforms to dy with the range of y equal to (0, 1). 


Let z = —2 log, Y, in which case dz = —(2/y) dy. Since the prob- 
ability density of y is dy , that of z is ` 


y 1 
ae = —¢ 7/2 qz 
2 
which is G(14, 1). 
If yi, yo, ++, Yr are k probabilities derived from k independent ob- 


servations ti, +++, xp from k distributions, all of which may be differ- 
ent from one another, then their joint distribution is 


dy, +++ dy, 

If 2; = —2log, y,, then zi is GY, 1) for all ¢. Tf P, is defined by 2 
+--+ zr, then Py, being the sum of k gamma variates with param- 
eters (14, 1), is distributed as G (4, k) or x2(2k). This distribution is use- 
ful in combining several independent tests. 
2a.9 Summary of Results 

Some of the importan 
together for later use. 


(i) If n, tts Nk, (Zn; = n), are the frequencies in % mutually ex- 
clusive classes with probabilities Ti, °**, Th, then, as shown in 2a.2, 


t results of this section have been brought 


. E(n;) = nz; 
Vim +++ hm) = nf 2a; — (Blin)? 
cov {(yny +--+ Link), (myny +--+ -+ men,)} (24.9.1) 


= n{Zlimas — (lyr) (Zmsn;)} 


SUMMARY OF RESULTS 45 


(ii) Defining as in 2a.4, 
e7 07/2 d 


ii 
N (u, o, £) dx = x 
oV 2r 


SNo 3, £1) +*+ N (lm Gn) tn) d1 +t dtn = N(u, 0,2) d& (2a.9.2) 


where p = (pı +--+ Hn)/n, ° = (o? H+ on”)/n®. The symbol 
_is used to indicate that the integration is over constant values of &. 
The distribution of y = x°, where x is N(p, o), is 


I paite wine Cosh (Vyuo) 
oV 2r y 


1 97500 er 1/nY (5e 1 
= —p?/207, —y/20*, M4 pa ee — |= 2 wiser Y 
7 MARU hez) +a tel ls 


dy 


oV on ¢ 
2\r 1 
= e725 ( we ) lg (= wha, v) dy (2a.9.3) 
207/ rv! 20~ 2 
(iii) Defining 
l. ad eee? dx 
Ga, P, x) ax = T(p) d 


the results obtained in 2a.5 are 


Sonaat Pı, 21)G (a, P2 ae) +- dt, dxz **° dEn 
= G(a, Pı H+ Pa X) dX (2a.9.4) 
Also 
Sa, ler?» 29s 4,29 dedy = BOP a ade (2a.9.5) 
z=a/(z+y) 


and 
a TT TFD fr of 
ce p, ®)G(a, g, Y) de dy = Toray aan Ther 


(24.9.6) 


(iv) Defining as in 2a.6, 


E T(a +b) 
B(a, b, x) dx = Tore) 


f B(a,b, 2)B( d, y) de dy = Blob +da) de (2.9.7) 
z=ry 


a1 — 2) dx 


ifa=c+d. 


46 THEORY OF DISTRIBUTIONS 


Í B(a,, by, £1)B(a2, bo, £2) +++ dry -++ dtn 
z=z1 -+> an 


= Bay, bi +--++ bn, 2) dz (2a.9.8) 
provided that a, = az + bz, a2 = az + bz, ---, 


2b Distributions Relating to the Univariate Normal Distribution 
2b.1 Mean and Variance in Normal Samples 


Let 2, +-+, x, be n independent observati 


ons from a normal popula- 
tion N (u, e). The probability density is 


const. e7 4/20? 
where 
$ = (a; — u)? = nz — u)? +8? 
(2y + + Th) 
n 
S? = E(x; — =)? 


t= 


Consider an orthogonal transformation from z to z: 


tS g 


ca 


Zi = Qty +++ Qinin 


PED enei 
Since the vectors of zı and z; 


(i = 1), are orthogonal 


, Day = 0, for . 
J 
*,”. Under this transformation 


i=2 


Ti, OE a EEA 
| 
m 1, ~ Va, 0, +0 


and, as shown in (iv) of 1¢.2 (invariance of distance), 
2 Hw)? = Ce — Vinay ae eet gt 


= n(E — u)? + 92 


r ag A ENE eee 
Therefore §2 = 2 a zp. The distribution of 2 ++. Zn is 
const. ON Vina) ast oh 22) 062 dz, +++ de 
n 


which shows that the dis 


tributions of zı and zp, 
and hence those of f= 


aL “**; 2n are independent 
a/Vn and $2 = 


2 oe are The distri- 


STUDENT'S DISTRIBUTION AT 


bution of € is 


—n(z—p)2/202 ni 
const. e-"@-" 7" dé const. = 4/—— 
2r o 


Since 22 is G(1/20?, 14) for i > 2, it follows from (2a.9.3) and (2a.9.4) 
that zo2 +--+ zn? = S? is distributed as G(1/20?, (n — 1)/2), i.e., 


1 


n—1 
9.2) (n—1)/2 
ale e( 2 ) 


const. e7 °/2""(S2) 2—2 dS? const. = 


or that of x? = S/o” is x? (n — 1). 


2b.2 Student’s Distribution 
The joint distribution of Vnlé — u) and S? is 


=i 
La ) as? 
on 2 


NO, 02) dVn(& — ») a( 


2 


Student’s ¢ statistic is 


nea) 
S 


where s? = 8?/(n — 1) or, squaring and rearranging, 


È _ n(& — 2)? =f 


| S? 


n(é — p)? is distributed as G(1 /20°, ¥4), so, from (2a.9.6), the distribu- 


tion of f is 
r(5) -4 
at 
3 1 ü n—1\(1+ 
PUET. 


and hence that of t, which is symmetrically distributed, is 


2 dt 
2 ee = RR (2b.2.1) 
1 
( x nu z) 


AG) 


48 THEORY OF DISTRIBUTIONS 


This is called Student’s ¢ distribution based on (n — 1) degrees of free- 
dom. When n = 2, this reduces to Cauchy distribution. 

Non-Null ¿ Distribution. To find the distribution of 2 = n&*/s? when 
u = 0, we note that the joint density of S? and y = nz? is (see 2a.9.3) 


1 2-1 nu?\" 1 1 1 
=n? a 2\ 5 1 t gi 
gT Ng (5 , ka an )z (5) = G (5 yr + Bi v) (2b.2.2) 


Making use of (2a.9.6) for each term of the infinite series we find the 
distribution of f = y/S? to be 


n 
T (5 + r) 
e7" 1 (ay 2 jr-4 


. Į n= Ï a + flr +r 
r(r+5)r( 2 ) 


=r 2 
x mina I” (: 1 f n Ja 


n=1\_/1 Cpe Sa TF a 
Pe 
2 2 
(2b.2.3) 


where ,/*; is the hypergeometric function defined above. The distribu- 
tion of °? = (n — 1)f can be obtained from this. Sometimes it is useful 


to use the distribution of R = S?/(y + 8?) which can be obtained by 
applying (2a.9.5) to each term in (2b.2.2). 


. 1 ðN /n —1 rt 
EPEa aa 
Hes ee (2b.2.4) 


2b.3 Fisher’s z Distribution 


If s1? and sọ? are two independent estimates of vari 
and ne degrees of freedom, then S,2 
joint distribution 


1 m 1 ng 
G (5 , z ; si) q5 , 3 , se) dS, dS? 


so that their ratio f = 81?/S22 has the distribution 


ny +n 
P ( 1 2) 
2 fo -1 


(rC) ET 
2 


df 


ance based on nı 
‘an 2 
= m1" and S2? = nos? have the 


2 


COCHRAN’S THEOREM 49 


The distribution of F = s1?/s = nef/m is 


ny + no 
r 2 po) =g 


7 (ni+na)/ ` 
ə n n ni+na)/2 
2 2 no 


Fisher defines 


This is called the variance ratio distribution. 


1 e I 
g= gee = 5 lose F 


á 


so that the z distribution is 


ny + Ne 
/2 T 2 aa 
no 


ee 
r (m+n)/2 
nı ng NI oz 
2 2 n2 
In practice it is convenient to u 


se the F distribution instead of z. 


2b.4 Cochran’s Theorem 
Let ay, +++, %, ben normal variates with zero mean and unit variance. 
aa 


Ifa teeta eater where q; is a quadratic form of rank 
ni, then the necessary and sufficient condition that q1, q2, °** are inde- 


pendently distributed as x? with ni, n2 *** degrees of freedom is that 


n = Dni. 
Since q; is a quadratic form of rank ns, it can be expressed as (see ii 
in 1¢.2) 
(2b.4.1) 


ala? + ig” seve Ling” 


where l; is a linear function of 21, ++ tn Also by hypothesis 


kom 
Ir? = 2a = > 2s + Uj? 
i=l j=1 


2 linear functions lij «which supply a trans- 


If n = En,, then there are? 
formation from « to l, viZ. 
1=xA 


with A as the matrix of transformation. If A denotes a diagonal matrix 


with +1 in the diagonal then 
xx’ = DÈ + liz? 


, 


= 1A = xAAA’x’ = xx 


e Jaar aaie 


50 THEORY OF DISTRIBUTIONS 


Since | A | #0, | A | #0. This shows that the transformation is non- 
singular, in which case the positive definite form xx’ remains positive 
even after transformation. Therefore the coefficients in (2b.4.1) should 
all be +1. The joint distribution of l is derived from that of x 


E a = KEE? 
const. e7 MP+: +za® day... de, ~ const. e7 22k II dli;, 


which shows that 1;; are independently distributed and so are t 


{25 
***; Gk, each of which depends on exclusive subsets of 1;;. Since qi is 
the sum of squares of n; independent normal variates lity ++, ing it is 
distributed as x? with n; degrees of freedom. This establishes the 


sufficiency of the condition. 

If q; is distributed as x? with n; degrees of freedom, then Xq;, being the 
sum of k independent x’, is also a x? with Zn; degrees of freedom. But 
Zq: = Zx?, being the sum of squares of n variates N(0, 1), is a x? with 
n degrees of freedom. Therefore n = Dn;. 


2b.5 Distribution of Non-Central x? 


Consider k independent normal variates 2, 
v2, +++, v and standard deviations 1, 
(21/01)? + (22/02)? +++ is x? with k 


‘++, £n with means v, 
g2,**+ of. The distribution of 
degrees of freedom only when 


¥ ="*+=~ = 0. To find the general distribution make the following 
transformations. 
Ti " 
Yi = — 1=1,2,---k 
Si 
Z=yA 


where A is an orthogonal transformation such that the first transformed 
variable is 


Vy Vk 
—M H-H — y A 
N E) 
Pas E 
v oe 


Then Be) =», Ee) =-..= E(@,) = 0, and Zi, ++, 2, are dis- 
tributed as independent normal variates with unit variances. 
Now 


DISTRIBUTION 51 


THE MULTIVARIATE NORMAL 
where x? = 212, and xo? = 22 bet z2. The joint distribution of 
zı and xə” is simply 

-mle +2241 F-? dz dx2 


const. e 


The distribution of x2” and a1” is 
wT ft aN. 5 
a G{-,7+ 3 dzi” 


3/2 1k-1 5 
6G | =a dx > \ > 

2 2 2 
Hence, by using (2a.9.4), each term of the above series can be reduced, 
yielding the distribution of 7 = a? t+ x 
sie ey by Q bey 
e" Ne) rt \2"2 


e 


2c Multivariate Normal Populations 
2c.1 The Multivariate Normal Distribution 
(i) A set of p variates x = (ay, tts Tp) is said to follow the p-variate 
Tira distribution if the joint probability distribution of the varia- 
les is 


CF 1g(x—- PAC BY dx 
and, when x is measured from the origin p = (u, ***, Hp), the distribu- 
tion reduces to pe 
C e7 IgxAx dx 

quadratic form xAx’ is positive definite. 


where C is a constant and the 
d from the relation 


ry f 
The constant is determine 
gji aiaa 

sts a non-singular transformation 


ositive definite, there exi; 


xAx’ = yBAB'y' = yy’ 


Since xAx’ is p 
|B) =|4\-* 


X = yB such that 
s. BAB’ = or 
ormation is 


The Jacobian of the transf 


) 


ays sty Up. 


=|B|=|4l-* 


52 THEORY OF DISTRIBUTIONS 
The above integral transforms to 


Cc A EEEN” 
papan tu dy, ae dyp 


C — My? = Myo? 
= raped Min ay fe FEDS ify sxe 
Al% 


g 
pa Er 

-a 

C= Gpek 


(ii) Since 
faces dx = 0 


being an odd function of zi and an even function of the rest, it follows 
that u; is the mean value of the original variable Ti. If E(a;2;) is the co- 
variance between the ith and jth variables, the expected values of the 
elements of the matrix x’x will be the elements of variance covariance 
or the dispersion matrix to be represented by R, 


R=C ff x'xe™ Ar gy 


If x = yB such that BAB’ = I , the above integral becomes 


mye 1 T 
c| B | feys. Y dy = Gaye J BY yBe- Iy dy 


1 

= B’ loom f Ie Vay’ 
(rra J VIE dy = 

= BIB = BR 
Since BAB’ = J, A = BB = (B’B)— or B'B = An. 
that the dispersion matrix of the variables is the reciprocal of the matrix 
of the quadratic form in the exponential. Alternatively, given the dis- 

persion matrix R the probability density can be written as 

1 
| R [42r 
The above proof is due to Nair (1949). 


This shows 


eT ERY 


DISTRIBUTION OF LINEAR FUNCTIONS 53 


2c.2 The Distribution of a Set of Linear Functions 
of Normal Variates 
Starting from the joint distribution 


— 14) g 
E dx 


it is required to find the distribution of u = (uy, +++, Ux), (k < p), 


defined by 
u=xB 


where B is a matrix (p, k) of rank k. Make the transformation 
w= xB v= 24 


where v = (vı, ++ *, Y—k) and A isa matrix (p, p — k) of rank (p — k) 
and is such that A’A7!B = 0. The total transformation may be repre- 
sented by (u v) = x(B A). Now the rank of 


hare aN 0 ) 
A Er 0 A'ATA 


is p, the same as that of (B A). This means that the rank of B‘AB 
is k and that A‘A71A is (p — k) in which case (B’A1B)™ and 
(A’A14)—? exist. Let the quadratic form xAx’ change over to 
uD,u! + uDav' + vD3v' 
u = xB and v = xA, we obtain 
BD,B' + BD2A' + AD3A'’ =A 
1 and AB, 


Substituting 


Pre- and post-multiplying by BAT 
BIA BD, B' AB = B'AB 


or D _ (BAB) A 


Similarly, D; = (4'A™™A) ~ 


Pre- and post-multiplying by B'A™ and A™™A, we find 
BIAS BDA'A A = 0 

or 

Dz =0 


The joint distribution of u and v is 


— 146(uDu' +wDav’) 
const. € ‘ du dv 


54 THEORY OF DISTRIBUTIONS 
Integrating out for v, the distribution for u is 
const. e7 BAB w gy 


which shows that a set of linear functions of normal variates follows a 
multivariate normal distribution. The dispersion matrix of u = xB is 


{(B'A 71B) }— = BAB 


which can be obtained directly without finding the distribution of u. 

In particular, if z1, +--+, £p follow a p-variate normal distribution, the 
subset zı, -*+, £ follow a k-variate normal distribution. It is also of 
interest to determine the conditional distribution of £41, +++, tp, given 
£1, T2, ***, tz. Consider the partitioned vector 


(X1 | Xe) = (Er 6°, Te Tega, +, Tp) 


with its dispersion matrix 


eo- ($18) 


Consider the linear functions 
Xo —x,T 
where the matrix T is chosen such that 
, Elx (x2 — xı T)] = 0 
Le 
B=AT=0 o =AB 
E(x — xT)! (x2 — x1T) = E(x2'x2) + T’E(xy'x1)T — 28 (x2'x,)T 
= 0 + B'A™AA™B — 2B/A1B 
= C — BA1B=D say 
The joint distribution of x; and ye = (X2 — xı T) is 
const, ¢~ 14H’ yD y dy, dyg 
and that of xı and xs is 


const. e7 “1411 — Max T) D (z2x T)" dx, dx2 


which shows that the distribution of x2, given xı, is the second part of 
the above expression. The matrices T and D are already determined 
in terms of known quantities. 


THE DISTRIBUTION OF QUADRATIC FORMS 5i 


ot 


2c.3 The Distribution of Quadratic Forms 
(i) Given the joint distribution 


CeT’ dx 


to find the distribution of the quadratic form xAx’ make the transforma- 


tion y = xB such that 
' 


xAx’ = yy 
in which case the joint distribution of y is 
const. eT tte buy?) dy, +++, dyp 


The distribution of rAz’ = Ly”, being the sum of squares of p variates 
N(O, 1), is distributed as x2 with p degrees of freedom. 

(ii) Suppose that the x; are subject to & restrictions specified by 
xB = 0, where B is a matrix (p, k). As in 2c.2 the transformation 


u=xB v=xA 


such that A’A7?B = 0 gives 

xAx’ = u(B’A7B) Tu! + v(A'A TA) V 
The value of the quadratic form xAx’ subject to the restriction u = xB 
= 0 is v(A/A71A)v’.. The distribution v for any u is independent of 
u and is given by 


const. e~ "dv Ds = (AA) 


By the result proved in (i), vD3v’' is distributed as x? with (p — k) 
degrees of freedom. So the quadratic form xAx’ subject to the condi- 
tion xB = 0 is distributed as x? with (p — k) degrees of freedom. 

(iii) If xAx! = gq Het % where q; is a quadratic form of rank pi, 
then the necessary and sufficient condition that q are independently 
distributed as x? is that p = =pi- This is essentially Cochran’s theorem 
with a general positive definite quadratic form xAx’ instead of xz. 
The proof remains the same. 

(iv) The necessary and sufficient condition that a quadratic form 
xDx’ of rank k is distributed as x” when x follows the law, const. exp 
— Vé(xAx’) dx, is that the quadratic form x(A — D)x' is of rank (p — k). 
The sufficiency condition follows from result (iii), since xAx’ = xDx' 
+ x(A — D)x’. To prove the necessity we observe that there exists a 
transformation y = xB which transforms (see 1¢.5). 


xax Oy? bet Up” 
xDx > ay bee MYR 


56 THEORY OF DISTRIBUTIONS 


The quadratic form xDx’ is thus a linear compound of the squares of 
independent normal variates N(0, 1). It may be verified that its dis- 
tribution is x” when and only when each A is equal to unity. Other- 
wise the distribution is different from x”. Take, for instance, My? 
+ eyo”. If this is distributed as x? with 2 degrees of freedom, then 

Ey? +My?) =M + de = 2 

Vay? + My) = 301 + M)? — Ade 

=3X4—4\d2 =8 
Ao = 1 


which means M = Ag = 1. Similarly, it may be shown by considering 
k moments that ^, ---, Ax are the roots of an equation (A — 1)* = 0. 
Therefore xDx’ transforms to y1? +- - -+ y.2, in which case x(A — D)x’ 
transforms to 7441 +- +--+ yp? or the rank of x(A — D)x' is (p — k). 

(v) Let xD,x’ and xDox’ be two quadratic forms distributed as x? 
with kı and ky degrees of freedom. The necessary and sufficient condi- 
tion that they are independently distributed is D,AD, = 0. 

Tn (iv) it is seen that the transformation y = xB transforms 


xDx! > yy? +-+-+ y2 
x(A — D)x! > Pega beet yy? 
Tf zi = (y1, +++, ye) and 2 = (Yki, +++, yp), then 
(z1 | Z2) = x(B, | By) = xB 
xDx! = 2,2;' = xB,'B,x’ or D = By'B, 
x(A — D)x! = 225! = XBo! Box’ or A— D = BoB, 
Since z; and Zg are independently distributed, 


B,AB,! = 0 
which gives 


By'ByA™Bs'By =0 or DA” (A — D) =0 
D=Da"D 


This is another form of the necessary and sufficient * condition for 
xDx’ to be distributed as x?. 

* Tt is not difficult to prove sufficiency because D = DA—D means that DATA — D) 
=0. From example 5 in 1c.5 it follows that D and (A — D) have independent row 
vectors. The rank of the vectors in D and (A — D) put together is obviously p. 
Therefore rank D + rank (A — D) = p. Then (iii) holds good. 


THE DISTRIBUTION OF QUADRATIC FORMS 57 


Since xD,x’ is a x”, Dı = Di A7*D;. For the same reason DeADo 
=D». If DjA~'D: = 0, then rank Dı + rank Də + rank (A — D, 
— D) = p. This shows, by (iii), that xD,x’ and xDox’ are independ- 
ently distributed. On the other hand, if xD,x’ and xDox’ are inde- 
pendently distributed, then x(A — Dı — Dz)x' is also an independent x. 
Hence there exists a transformation which transforms the quadratic 
forms xD x’, xDox’, x(A — Dı — Də)x' into sums of squares of inde- 
pendent variables. Proceeding as above it is seen that D,A7'Ds = 0. 

(vi) The necessary and sufficient condition that a linear function Ix’ 
is distributed independently of a quadratic form xDx' which is a x? 


is that 
1A7D =0 


This follows from (iv), since the quadratic forms xl'lx’ and xDx’ are 
independently distributed: 


via ™D = 0 or 1A D = 0 


(vii) The necessary and sufficient conditions in (v) and (vi) are true 
For instance, it is not necessary to 


assume that the quadratic forms are distributed as x”. This has been 
This assumption is not 


assumed to obtain simpler proofs of the results. 

stringent because by using (iv) we can always test whether some given 

quadratic forms are distributed as x? or not. If they are, then questions 

of independence arise; otherwise the results are not important. A 
(viii) The distribution of the quadratic form xAx’ when x is dis- 


tributed as 


under more general conditions. 


— a-pa- 
conie PE ACEN dx 


can be obtained by making the transformation, 
x=yA 
such that paja” 


xAx’ = yy’ AAA! = 


E(x) = EQ)A or Ely) = pA 


The distribution of Zy; is the non-central x? of 2b.5 with the value of 


Pa (ed)? = pata = pe, 


and k = the number of variables. 


58 THEORY OF DISTRIBUTIONS 


2d Least Squares Fundamental in Distribution Theory 


2d.1 Two Theorems on Least Squares 


Suppose 41, Y2, *-*, Yn are n independent normal variates with the 
same variance o° and 


Ely) = Gat, +++++ ante, t= 1,2, -+-,n 
where a;; are elements of a specified matrix A and 7, 72, +++, Tp are un- 
known parameters. 
(i) If Ro? is the minimum value of 
Bly: — Qt, — +++ — ante)? 
when minimized with respect to 71, +*+, Tk, what is the distribution of 
Ro?? : 
Let there be r independent vectors in the set 
ai = (Gi, Azi ++, Ons) i=l, k 
Then there exist (n — r) vectors Bı, ---, Bn—r all orthogonal to a, =+», 
ar, So that a;-B; = 0. The B vectors themselves can be chosen to satisfy 
the conditions B;-B; = 1 and B;-B; = 0. They are the vectors of the 
deficiency matrix considered in 1a.4. The vector y can be expressed as 
a linear function of æ and B (example 3 in 1a.4). 
y = Cay eb crag + dy By + +++ dnr_,Bn_> 
Ely) = ray e+ + Tka 
Multiplying by B; we find 
Bry =d:; and Eld) = B;-E(y) =0 
Also 
Vd) = Bi- Bio? = o? cov (did;) = Bi- Bjo? =0 


Hence dı, +++, dn—r are all distributed independently in N(0, o?), in 
which case 


(dy? +++++ Paa) 
5 
is distributed as x? with (n — r) degrees of freedom. 
If c = (c1, «++, cx), then 
ly — EF = (e — T)A'A (c — T) +da? H- +È 


which attains the minimum value 


n—r 


Rè = dy? +... da 


TWO THEOREMS ON LEAST SQUARES 59 


when (c — +t) =0. Hence Ro?/c” is a x” with (n — r) degrees of 
freedom.* 
(ii) What is the distribution of 
R? = minimum of (y; — aati — +++ — ait)” 


when minimized with respect to 71, ***, Te Subject to s independent 
conditions 


fet = furi bet fiers = 1 
fo-T = forts +++ Saute = 92 (2d.1.1) 


fT = fairi Hetet Sontk = Gs 
Starting with the representation 
y = aa et crar + dBi tHe -+ dn—rBn—r 
and multiplying by a1, ***, @k, We obtain 
ay = Cray ay bee + Cka a 
Ab acare 
ar: y = cia ap heb Cak ak 
which may be written in the matrix notation 
yA =cA’A 
If there exists a vector 1 such that 14’A = p, then 
E(c-p) = E(cA‘AY’) = EA!) 
= E(y)AY = TA'A! = tp! = Tp 
V(c-p) = V(cA’AY) = V(yAV) = 14’Al'o? = p:lo? 
Tf l; and ly are two vectors such that 1,4’A = pı and 1,A’A = po, then 
cov (¢-Pi, ¢*P2) = Pao” = Ip- pio” 
Let lı, ---, 1, be such that 
LA'A = pi s LA'A = De 
and 
aj = cpr — Picts °°") Ze = CPt — PeT 
* From this it follows that the expected value of the residual sum of squares is 
(n — r)o?. To prove this it is not necessary to assume that the) variables are nor- 
mally distributed. The result follows from the fact Vidi) = Ei) =o. 


60 THEORY OF DISTRIBUTIONS 


The dispersion matrix of z is 


why ee Ply 
De = ( š Prr 2 o? 
Peli + Pek 
in which case 
A aay A 


2 


Co 
is distributed as x? with ¢ degrees of freedom. Also 
cov (di, ¢-p;) = (B;-1;A’A)o? = 0 

so that the d and z are uncorrelated. Hence 


zD™!z’ + d? +-+ ia 


o? T 


is distributed as the sum of two inde 
degrees of freedom which is the same 
freedom. 


To minimize (c — T)A’A (c — T)’ subject to the conditions (2d.1.1), 
we observe that the conditions (2d.1.1) could be replaced by an equiv- 
alent set of s independent linear combinations 


pendent x”s with ¢ and (n — r) 
as x? with (t+ n — r) degrees of 


Pi'T = Pury Feb pare = = Gy 


PeT = puri +--+ par =; 


(2d.1.2) 
Pii T = Punit H+ peggy = Giga 


PeT = Par Hert Date =G, 


such that the vectors Pi, +*+, P: lie in the space of vectors in the matrix 
A'A and no linear combination of Peti, +++, ps lies in the space of A’4 
(example 3 in 1b.1). Let (s — 2) be the rank of the matrix 


(8:-f;) 


where 81, 82, «++ are vectors orthogonal to those in A 
same as those orthogonal to row vectors in A. The vee 
long to the restraining conditions (2d.1.1). 

The number of vectors Pi, 
minus the rank of (8;-£,). 


‘A which are the 
tors fi, +++, fẹ be- 


++, P: is obtained from the rule, t = s 


TWO THEOREMS ON LEAST SQUARES 61 


It may be observed that (2d.1.1) is being replaced by (2d.1.2) for 
proving a result and not necessarily for convenience in determining the 
residual sum of squares which may be obtained in any way. > 

Using Lagrangian multipliers, we consider the function 


(e — T)A'A (c — T)! + 2[M T — Gy) H Np T — Gs)} 
The minimizing equations are 
(c — T)A'A — Npr — ++ Aps = 0 
or 
(c — TJAA — Npr = 22 = Api = AeqzrPez1 — +++ Nps 


This shows that there exists a linear combination of py41, +++, Ps Which 
can be expressed in terms of the vectors in A'A, unless \y41 = Mpo 
=... =), =0. Multiplying the minimizing equation by (c — T)’, we 
find the optimum value of (c — T)A'A (c — 7)’ to be (c — T): (MPi 
+--+ dp). Also multiplying the minimizing equation by lı, +*+, ly, 
defined earlier, we obtain 


(e = TAAL! = 1 = MP be Apel 


(e = T)A’AL = z = MP + “+ pel 


which yields the solution 
Ang 81) A) = 2D 
(c — T)A'A (c — 7)’ is zD~'z'. The 


so that the minimum value of 
bject to the conditions (2d.1.1) is then 


minimum value of [y — EQ]? su 
R? = zD! + d? sy alae + usr 


at R,2/o? is a x? with (n — r + t) degrees of 


and it is already shown th 
vectors in the f space depending on 


freedom, where ¢ is the number of 


the column vectors in the matrix A. , , 2 
Tt also follows that the difference R1? — Ro” between the conditional 


and the unconditional minima is distributed as 02x? with ¢ degrees of 


freedom. 
Suppose that R2 is the minimum sum of squares when some more 


restrictive conditions are given. Using all the conditions, let u be the 
extra number of independent vectors which can be expressed as linear 


combinations of œ; then by the above argument 


thu ` 6 m 
R? = DOA nog; ta“ tHe -+ dnr 
j=l 


62 THEORY OF DISTRIBUTIONS 


where 1/07(A*/) is inverse to the variance-covariance matrix (\;;)o? for 
Z1, °**,2¢4u)- This shows that Ra?/o? is distributed as x? with (n — r 
+ ¿ + u) degrees of freedom. The difference 


tru - t os 
R? — Ry? = >> 22; — SY d¥z,2; 
ij=1 ij=1 


is distributed as o°x? with u degrees of freedom. (See 2¢.2 and 2c.3 
where the x” is split into two independent components, xAx’ = uD,u’ 


t 
+ vDev'. One of the parts may be identified with 22 dzz). 


J 
Consider the special case where n sets of (p + 1) variates Ti, a aiey 


®p41 are such that the conditioned expectation and variance of Lp41 are 
as follows. 


E@p 41,3) = æ + Bitz +++ ++ Bory: 
f= 1,2, 9, n 
V(ap41) =o 


The variate xp4; for given 2, +++, p is considered to be normally dis- 
tributed in the following examples. The values of Tı, ***, Xp are taken 
as fixed quantities. 


Example 1. The minimizing equations for B coefficients are (writing b 


for B) 
biSis + baS2i +++ +++ bpSpi = Sopii 
i = 1, iti S p 

where S;; is the corrected sum of products for the ith and jth variables. 

Example 2. If (S"), is the matrix reciprocal to (S,;)», (i, 7 = 1, +++, p), 
then the dispersion matrix of b,, «++, bp is o°(S"),. [Hint: VSw41:) 
= Suia? and cov (Sep41 S414) = Sijo?.] 

Example 3. The unconditional residual sum of squares is 


Si; 
Stent) = Sop — +++ bS otp = [Sti Ino 
[Silo 
Example 4. If « = 0, the minimum sum of squares is 
| Sig + ni; lpg 
| Si + nif; is 

where @; is the average for the ith variable. 

Example 5. The statistic 
_|Sit neal, |Sol 


ae 
Sil [Sal + NE Ej lp 


TWO THEOREMS ON LEAST SQUARES 63 


nxn—-p-—1 1 
Ba tg 
when a = 0. 


[Hint: Rp41 = x17/(x1? + x2”) where xı? and x,” have (n — p — 1) 
and 1 degree of freedom, o°x1" being the minimum sum of squares, and 
o°x2” the additional value when a = 0. Hence by (2a.9.5).] 

Example 6. The joint distribution of bı, bz, ++, bp is 


has the distribution 


4 


| Siz Pp 


n 
— 362 DE(bi— Bi) (bjp) Siz iby coed 
To Apl 1 
(2r0”)”* 


P 


(Hint: b; are linear functions of p41, and their dispersion matrix is as 
given in example 2.) 
Example 7. The distribution of 


B? = DEdibjSij 


is that of non-central x? (viii in 2c.3) 


an 28 2s 
Gonstie TABI > fale le 
81278 I(s + p/2) 
where 
TTBiB Siz 


ad 


Example 8. Defining multiple correlation 
B? 
~ B+ WwW 


where W? is the residual sum of squares which has the distribution 
GUL /262, (n — p — 1)/2], show that the distribution of R is 


R? 


—8?/2 =f 2R? 
a pig- Rje Grs ) ar 
t n-p-! R 
Pa 


[Hint: Write the joint distribution of B? and W? and apply (2a.9.5) 


to each term.] , 5 
Example 9. The joint distribution of p+ S@4iy1, +") Sop 


So+1)(+1) can be derived as follows. 


64 THEORY OF DISTRIBUTIONS 


The distribution of Zp41, b1, «++, bp, and W? is the product of the dis- 
tributions of Ēp4+1; bi, b2, +--+, bp; and W?. 


Vn 
2ro 
| Sij Pa 
(Qra?)?!? 


g7 t2 


dip41 


e7??? db «++ dbp 


eo ¥3/20? w2 (n—p—3)/2 dw? 
CAPT — pHa 


where 
Pi = näpi — œ — Pity ==)? 
Yo = EZSi(b: — B:) (b; — B3) 
ya = W? = Sopit) — bS Si bop = Lika 


all adding up to 
y= 2D (x(p4iyr —@a— pity — 8s Bad 
The connecting relation between b and S()41); is 


Sp4ni = bS ii ++ +++ bopi 


Therefore 
D(S(p 4191; S42 ++ Sop) | S| 
D(b1, +++, bp) one 
Also 
aw? 
OS, 
Hee P+1)(p-+1) 


re 2 + y 
d&p41 dbı +++ dbpdW* ~ A dp41dS pays ++ dS Sopo 


The joint distribution can be written 


n ` 2 Si; (n—=p—3)/2 
const. e7 144 +v2e] gp { ij tt 


|Siilp 
“Tsh [Sal lp doti Wotan + dS 


Vn 1 1 


const. = SF SS; 1 
V 2ro (V 2ro?)P!? 


202)=2=0/2p = 
2 


. 


MULTIVARIATE DISTRIBUTIONS 65 


where Y1, V2, Y3 can now be expressed in terms of the variables occurring 


in the differentials. 
Example 10. Since the above distribution could be obtained by 


direct integration, it follows on omitting the exponentials 


1 n 
(z Í dtp+n1 A@+i2 * dEn 
2r/ JzpuSpey t SDD 


res 


| Sij l+ 
| Slo 
where the value of the constant is the same as in example 9 and does not 


= const. l | Siz |p dtp Wop4n1 ++ Sow 


involve any Sij. 


2d.2 Multivariate Distributions 
In examples 1 to 10 of 2d.1, 


relative distribution of wp41, given Tı, **', tp, 
obtained are all relative distributions for fixed values of 2 + 


these distributions are multiplied by the joint distribution tı, + 
and integrated for these variables, then unconditional distributions are 
obtained. We shall assume that 21, T2, ++", tp follow a p-variate normal 


what has been considered is only the 
so that the distributions 
-fy LE 


+9, Bp 


distribution. : 3 
Multiple Correlation Distribution. For instance, in example 8, the 
quantit; 
y a _ (226:8)Sis) 


o 


a random variable if 1, «++, tp» are not 
= Bx, +++ Bptp, which is normally 
+, Wps 


occurring in the distribution is 

fixed. Consider the variable z = 

distributed, being a linear function of tı, * 
Vie) = 228i = 2" (Bay) y 

Since 078? = D (zr — 2)”, 


-is the covariance between ti and zj. 
r= 
et, it follows 


where gij 
e+ Bptpr corresponding to the rth si 
e 2b.1) 

o%8?/23%gn—2 dp 


where z, = bitir h 
that the distribution of B is (se 


const. €` 
The joint distribution of R and £ is 


P(R, 6) = P(@)P(R| 8) 
—0%8/22%gn—2 dg e PPR? —(1 = 


= 2R? 
mai we ar 


2) (n—p—3) |2 
= const. € R’) 


walor 2 


66 THEORY OF DISTRIBUTIONS 


Expanding ıFı and integrating term by term for 8, we obtain the un- 
conditional distribution of R. 
te A 
al s) 
const. R?—1(1 — R?)™—?—92 dR — 
s=o S! T E + ) 
—+s 
2 


© aia (B2\ HB) 248 
x re f ele +23*)/(2E%)]8? (5) dp? 
0 


n—1 = 
-CCH 
const = = peA 5s SN A 


s=0 


SIT E + s) 

>? 8 2 
—+—— R? ) dR 

= (= +>? ) 


(1 — 77)" = ae n-1n—-1 
= at e R?)@ P rap, ( 


P n—p-1 ap 2r” 
a. 2 

Pp 

Bi vit) dR (2d.2.1) 


where y? = 2?/(c? + 37), the ratio of variance due to regression to total, 
is the measure of multiple correlation in the population. 

Wishart’s Distribution. The problem is to find the joint distribution 
of the corrected sum of squares and products arising out of n sets of 
observations from a k-variate normal population. If 


Cm "ee Uy 
° s 
Tin °°? Lkn 


represent the observations, their probability density is 
const. e~ 
where @ = DAN” x (tir — wi) (jr — Hj) = pi + bo 


Q1 = ZEN nF; — mi) (j — py) 
$2 = DENSa 


MULTIVARIATE DISTRIBUTIONS 67 
S;; being the corrected sum of products. The joint distribution of 7; and 
Sj; is the product of 


const. e7 @1t#2 (24.2.2) 


ii dirr +++ Ain (20.2.3) 
over all 2; and Sij 


The value of (2d.2.3) is equal to 


and 


const. dzu t’ den f dito, +++ dtn +e 
#1,S, £2,S21,Sa2 


Ze 


which on repeated applications of the result in example 10 of 2d.1 


reduces to 
const. Sy? d& dSir 
Sı Siz 


S12 Soo 
-1 2 


(n—1—3)/2 


di dS21 dS22 


£ E | Sij la (n—2—3)/2 : 
X | Sil [Sule d&3 dS31 S32 AS33 
ij 12 


shy | Sij |e [n—(k—1)—3]/2 4 
x | Sij Niet a [Sih di, dSkı +++ dSkk 
ij |k—1 


= const. [| Si; |a] E22 day +++ dër dSii +++ dSun 
since all the other terms cancel out. This in conjunction with (2d.2.2) 
gives the distribution of Ziya °*° 


cont = MEIN az, +++ ati 


and that of Sij 
const. e” MENIS Si n dS, +++ dSkk 
This is known as Wishart’s distribution with (n — 1) degrees of freedom. 
normal distribution the joint distribution 


Example 1. Fora bivariate ; r ” 
of corrected sums of squares 812, So” and the correlation coefficient r is 


a? 
1 S? BerSaSs + F) g a-g —2(1 c= 72) 12 dSy dS ae 


i aioa 


68 THEORY OF DISTRIBUTIONS 
Example 2. If p = 0, the statistic 
r 
t= ws Vn—2 
er 


is distributed as Student’s t with (n — 2) degrees of freedom. 
Example 3. If p Æ 0, on making the transformation 


and integrating for ¢ and z, the distribution of r becomes 


-2 A F 
const. (1 — r?)("—4)/2 a” {es (=r) | 


lr 
d(rp)"~? 4 /1 = por Í ar 


Example 4. The constant in Wishart’s distribution in the special 
case when (A;;) is a unit matrix is 


n\p”—1)/2 
G) 
ar /4 Il r (* es *) 
A 2 


(Hint: Retain the constant given in example 9 of 2d.1 in evaluating the 
successive integrals leading to Wishart’s distribution.) 
Example 5. For any (Aş), by making a suitable transformation the 
constant is found to be | A? |“—)/2 times the value in example 4. 
Example 6. If 8;;' and S;;j” are the corrected sums of products in two 
independent samples of sizes nı and ng, then 


Sij = Si! + Si” 


follows Wishart’s distribution with (n; + Nz — 2) degrees of freedom. 
Example 7. Show that the distribution of the correlation coefficient 
of a fixed set of n quantities with a random set of n independent ob- 


servations from a normal population is const. (1 — r?)@—4)12 dr, 


If (fi, fa) +++, fr) is the fixed vector, define p; = (fi — PVEN: — NP; 


(¢ = 1,-+-+,n). Make the orthogonal transformation 


aae rt oq 
y/n 


Yo = Pıtı -+-+ Pain 


w= 


MULTIVARIATE DISTRIBUTIONS 69 


and the rest being suitably chosen. It is seen that the required statistic 
is yo/V yo? +-+ Yn” where y2, ***, Yn are all independently dis- 
tributed. The problem is further reduced to determining the distribu- 
tion of y2/W yo” + x? where yə is a normal variate and x? is independ- 
ently distributed with (n — 2) degrees of freedom. 

Example 8. Find the distribution of 


Dfiti 


where fi, fo, «++ and ti, %2, *** are as defined in example 7. 

Example 9. If x and y are independently distributed stochastic 
variables and at least one has a normal distribution, then the distribu- 
tion of the correlation coefficient is the same as in example 3 with p = 0. 
(Hint: The distribution in example 7 is independent of the fixed vector.) 

Partial Correlation Coefficient. Let the variates 1, T2, ***, pı be 


such that, given 1, ***, Up—t 
E(x) = 01 + Bipti +++ ++ Bert 
E(tp4a) = a2 + Brot HH Boy @+)%—1 


The correlation between zp and tp- for a given set of values of t1, +**, 
%p—1 is called the partial correlation between x, and zp41. The partial 
correlation coefficient is estimated by correlating the residual pairs 


Sp — Gy bipti 
and 


tp41 — 42 — bi~p43)%1 


where a and b stand for the values of æ and 8 which minimize the re- 


sidual sum of squares. 
The vector xp containi 


Xp = ti + Dipk1 eee bep—1ypXp—-1 + eprya HF ep(n—p)Yn—p 


ng n values can be represented by 


where i= (1,1, -°°, 1) and yi, °°", Ya—p are mutually orthogonal 
vectors orthogonal to i and x;, and ép1, €p2 *** are suitably determined. 
X41 has a similar representation from which the following quantities 


can be constructed. 


1 
Spp 


ll 


2 
epi +e t E pnp) 


2 
Sogno = Coen bet Epaien 


Farn = erie + -+ ep(n—p)C@-1)(n—P) 


70 THEORY OF DISTRIBUTIONS 


which supply the residual sum of squares and products. The estimated 
partial correlation coefficient is 


A So@+) 


t c 
V Spp S'o) 041) 
It is easy to verify that 


E(ep) = 0 Vlen) = V(ap) = a? 
Elep4:) =0  Ve@sni) = V(ep4a) = 02? 
cov (Cpie(p4.yi) = Poir? 

where p is the partial correlation coefficient. . 
If zp and xp41 for a given 2, ---, zp»; are normally distributed, then 
ep; and ép4; can be regarded as observations from a bivariate population 
with correlation p, and the distribution of r’ is the same as that derived 
above with (n — 1) replaced by (n — p) = (n — D= (=i), 

responding to (p — 1) eliminated variables. 


Distribution of T, D?, etc. Consider the following n sets of observa- 
tions: 


cor- 


711 U1 
T12 Tp2 
Tin *** pn 


having the joint distribution 


— 17] B giS, F E TENOR +2 
const, e7 22l@u—w)? ani? + Hapi?) dz +++ dpn 
Defining 


R= | Si; + NE x; ka ; | Siz + NE Xj ls 
= + 
| Sj |e | Suli 
we find from example 5 in 2d.1 that the distribution of R, for s = 1 is 


me T 
a( *,5) dR, 
2°32 


Considering the variables in the order Bp; 
of Rp, Rpt, «++, R2 is 


n-2 1 m= 3 1 n=p 1 
B 5) ata ( 15) aR vr ( a 
( 2 9 2 2 2 3 B 2 ) dR, 


and that of R, is as shown in (2b.2.4). 


E = L (ey fi 1 
ë way = (~)a( sr +2)aR, 


o r! 2 


***, %2, the joint distribution 


MULTIVARIATE DISTRIBUTIONS 71 


The joint distribution of Ry, Re, +++, Ra is 


a UL = = 5) n—2 1 
,—np?/2 SALB : -)dR,B = 
ý 25 aed a2 
m—t I 
at, + B( 15) a 
2 2 


Using the result (2a.9.8) for each term, we find the distribution of 
| Sij le 
=] Siz + nt; l 


S, = RiR +: 
is 


ently L 5) B = Z A r+ i 
rh\ 2 2 
i (1 — Sone 
= const. 8-221 — S)¢-2P Fy E, ‘Cee yas, (2d.2.4) 


Similarly, the distribution of 


[Sy + nity |e | Siz lp 
R= 2in _—<— ee 


~ [Si + nest; lp | Sij le 


n— P. Pp — 
dk (2d.2.5) 
se a) 


and is independent of S; since Ry, «++, Rp are independent. 
In proving Wishart’s pee it was shown that the joint distri- 
bution of the means 41, * **, Ēp and the corrected sum of products, Sj, is 


gz(ndi —1)/2—(p+1)/2 
const. artes Etn Sgene lide; UdS,; 


The joint distribution of S: and R could be directly derived from the 
above expression. Hence we obtain the following lemma. 
Lemma. If the variables 21, ++» Zk and cij, (j = 1, 

probability density 


-, k), have the 


— El +e „. |a12—(k+1)/2 
const:e yrl? cii) +j cij |2 


then the statistic 
| Cij le 

a Re Ne 

| cij + 2:23 le 


S:= 


has the probability density 
TN (eae! T 242 
afat (5) B a t 4+) dS: (24.2.6) 
r!\2 2 2 


72 THEORY OF DISTRIBUTIONS 


and 
Ba | cij + zz; le ZERAN 

| ez |e | ci; le 
has the probability density independent of S; . 

—k+i1k-t 

B (=Ħ, a ) at (2d.2.7) 
Example 1. Show that the statistic 
s [Sal 


| Sy + NE; le 
is invariant under linear transformations of the variables By, ey Sys 
(Hint: If L is the transformation matrix | S; J=|L || S| | Z|). 
Applications of the Lemma. (i) Consider a sample of size n from a 
p-variate normal distribution 


const. e7 E- MAG hy)’ dx 


To find the distribution of S, we first make a linear transformation 
x = yL such that 


(= BAG = p ~ = u)? + ye? +e ty? 


where u = Ap’. Since Sp is invariant under lincar transformations, it 
has the distribution (2d.2.4) with L = pAp’. Suppose that the mean 
value of every linear function 


M421 + azto +--+- anty 
uncorrelated with x1, x2, ++ 


*, t is zero; then we can make a lincar 
transformation of the type 


Yo = 4% +--+ aut 


Ye = ant Hees ay ty 


Year = Quiti +++ + aaqryptp 


Yp = Api, s+ ppt 


such that y1, ---, yp are uncorrelated. Since S: and R are invariant 
under the above transformation, it follows that R has the distribution 
(2d.2.5) under the assumption that any linear function uncorrelated 
with 21, ---, 2 has zero mean value. 

(ii) Suppose that two samples of sizes m; and ng are available from 
two p-variate normal populations having the same dispersion matrix. Let 


MULTIVARIATE DISTRIBUTIONS 73 


di = a — i = Difference in mean values for the ith variable in the 


sample. 
ôi = ui — wig = Difference in mean values for the 7th variable in the 


population. 
The dispersion matrix of d is 1/e = [(1/n1) + (1/ng)] times that of x, so 
that the probability density of d is 


const. e70/2(a—-B)Ad—8)" 


The p(p + 1)/2 quantities 
Sij = Sei’ + Si” 
where §;;/ and S,;" are the corrected sums of products in the first and 


second samples, are distributed in Wishart’s distribution with (nı — 1) 
++ (na — 1) degrees of freedom. Hence the statistic 


[Shb 

| Si; + cdd; |p 
is distributed as in (2d.2.6) with g = (m + n2 — 2J, t=p f = c AS’. 

If all lincar functions uncorrelated with 21, -+-, ty have no difference 
in mean values between the two populations, then, by making a trans- 
formation similar to that used in (i) above, it can be shown that 

, | Siz + odid; |e, | Sig + edid; |» 
© [Sale | Sis lo 

is distributed as in (2d.2.7) with g = (nı +n — 2), k =p,t=t, 

The distribution of Mahalanobis’ D“, connected with Sp by the rela- 


tion 


Sp 


1 


Ss = —————— Dy = prs"d,d; 
P c D 5 
1 
© m + ng — 2 P 


where (s%) is the matrix reciprocal to (si) = (Szj/(m + ne — 2)) or 
that of Hotelling’s T defined by 
Sp 


1 
EETA 


can be deduced from that of S. When 6; = 59 =- -= 0 the distribu- 


tion of Sp is 
, my tte Pat 2) ds 
B 2 3 2 p 


74 THEORY OF DISTRIBUTIONS 
or that of (1 — S,)/S, = T is 
TP! 


const. G+ 7,) eve = T) DA 


dT, 
which means that 


m+nrm—-p-l1 nm +2g-—p-—1 e 
Tp or 


D, 
Pp p m +n — 2 
is distributed as a variance ratio with p and (m + nz — p — 1) degrees 


of freedom. Similarly, when the conditions under which the distribu- 
tion of R is derived hold, the statistic 


to 


ý ntn —p—1/1 S aa T EE 1) 
g p—t R p—t Lit ©, 


is distributed as the variance ratio with (p — t) and (my + na — p — 1) 
degrees of freedom. 


References 


Boss, R. C., and S. N. Roy (1938). The distribution of Studentised D*-statistic. 
Sankhya, 4, 337. 
Cocuran, W. G. (1935). The distribution of quadratic forms in a normal system, 
with applications to the analysis of covariance. Proc. Camb. Phil. Soc., 30, 178. 
Fisner, R. A. (1915). Frequency distribution of the values of the correlation coef- 
ficient in samples from an indefinitely large population. Biom., 10, 507. 
Fisuer, R. A. (1924). The distribution of the partial correlation coefficient. M: elron., 
3, 329. 
Tsuer, R. A. (1928). The general sampli: 
coefficient. Proc. Roy. Soc. A, 121, 654. 
Horenune, H. (1931). The generalization of ‘Student’s’ ratio. Ann. Math. Stats., 
2, 360. 
Hsu, P. L. (1938). Notes on Hotellin; 
Nar, U. S. (1949). Allahabad Science 
Nanm, H. K. (1949). 
Soc., 41, 121. 
Pearson, K. (1934). 
University Press. 
Pearson, K. (1934). 
University Press. 
Rao, C. R. (1946). 
Sankhyā, 7, 407. 
Rao, C. R. (1949). On some 
characters. Sankhyd, 9, 343. 
Romanovsxy, V. (1925). 
17, 57. 
_Wisnart, J. (1928). The generalised product moment di. 
"a normal multi-variate population. Biom., 20A, 32. 


ng distribution of the multiple correlation 


g's generalized T. Ann. Math. Stats., 9, 231. 
e Congress, Presidential Address. 
A note on conditional tests of significance. Bull. Cal. Math. 


Editor, Tables of the incomplete gamma function. Cambridge 


Editor, Tables of the incomplete beta function. Cambridge 


Tests with discriminant functions in multivariate analysis. 


problems arising out of discrimination with, multiple 


On the moments of the hypergeometric series. Biom., 


stribution in samples from 


GCHAPTER 3 


The Theory of Linear Estimation 


and Tests of Hypotheses 


3a Linear Estimation 


8a.1 Observational Equations 
Let y1, +++, Yn be n independent stochastic variables with a common 


unknown variance g? and having as expectations linear functions of k 
unknown parameters 71, ***) Tk the expectation of y; being 


Ely) = aati +++ +b QikTk ae =n 
cients a;; are known. Let the rank of the 
matrix (a,;) or the number of independent linear functions on the right 
side of the equations (3a.1.1) be r. The quantities n giving the number 
of observations, k the number of unknown parameters and r the rank 
of the matrix (a,j) can be quite general and need not satisfy any equality 
or inequality relationships. Equations such as (8a.1.1) are called the 
observational equations. Nothing need be assumed at this stage about 
the actual distribution of the stochastic variables. 


(8a.1.1) 


where the compounding coeffi 


8a.2 Best Unbiased Estimates 
A linear function pit1 +°** + PkTk where the p coefficients are known 


is called a linear parametric function. This parametric function is said 
to be estimable if there exists a linear function biyı +*+ bnYn of 
the observations such that 

E(biyı aa s+ DnYn) = patr eet Date 
satisfying the above condition is called 
an unbiased estimate of p171 pe-e perk. If no such linear function 


exists, the parametric function is said to be non-estimable. , 

An unbiased estimate with the minimum possible variance is said 
to be the best unbiased estimate. The mathematical discussion of arriving 
at an estimate with the minimum possible variance out of a large class ` 

75 - 


A function biyi +*+ OnYn 


76 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


of unbiased estimates is known as the theory of linear estimation, which 


was originally considered by Gauss and later explicitly formulated by 
Markoff (1912). 


3a.3 The Necessary and Sufficient Condition for the 
Existence of an Unbiased Estimate 


If biyi +--+ bayn is an unbiased estimate of Pit1 -+++ perp, then 
E(biyi ++ +++ ban) = pity feeb PETK (8a.3.1) 


Using the expected value of y from the equations (3a.1.1) and equating 
the coefficients of r on both sides, the following equations are obtained: 


Pi = bitii +++ ++ brani t=1,---,k (3a.3.2) 


This means that the set of equations (8a.3.2) treating b as unknown is 
soluble. Also, if the equations (3a.3.2) are soluble, then (3a.3.1) holds. 
The necessary and sufficient condition for the estimability of the para- 
metric function pyr; +---+ pyre is that the set of equations (3a.3.2) 
is soluble, the condition for which is given in 1la.5. The vector p= 


(p1, +++, pr) must depend on the row vectors in (aij). 


3a.4 Normal Equations 


If the solution of (3a.3.2) is unique, then there is only one unbiased 
estimate and that is the best possible. In general there will be a multi- 
plicity of solutions, and the one for which the variance is the least has 
to be chosen. The variance of bit Fee eF baya is (by? Feet baoe: 
The problem, then, reduces to minimizing (b1? +- - -+ b,2) subject to 


the conditions (3a.3.2). To obtain the restricted minimum we need to 
consider the expression 


n k 
2b? + 2 È (p; — bray =- -+ — yas) 
where M, +++, Ag are Lagrangian multipliers and to differentiate with 


respect to bi, +++, bn. The minimizing equations are 


bi = Man ++ Man j=l, cy 


On eliminating b in (32.3.2) with the use of 
à are obtained 


(8a.4.1) 
(8a.4.1), the equations giving 


Di = (ara) +--+ (apa) i= 1,++-,&  (8a.4.2) 


where a; is the vector (a, azi, +++, ani) and ai: æj is the vector product 
G4 i0j T° Aniani. It is enough to get a single set of Qu, +++, Ax) 
satisfying (3a.4.2) for substitution in (3a.4.1) to obtain the b coefficients, 


LINEAR FUNCTIONS WITH ZERO EXPECTATIONS 77 


since these are unique * so long as M, +++, Ag satisfy (8a.4.2). The best 
estimate is 


biyi Hee bayn = (ey: y) +++ (ax y) e (3a.4.3) 


where y = (yt, «++ Yn) and ai y = aii +++ AniYn The condition 
of unbiasedness gives 


MEla y) + MElaz y) += pit + Pate Hte 
This shows that if the observational equations (3a.1.1) are replaced by 


Elay) = EQ) = larar HiH (arraie (3a-44) 


i=l, k 
then any linear function of Qi, ++, Qr, unbiased for a parametric fune- 
tion pyr, + +--+ pete, is unique as a function of yı, ***, Ya and is 


also the best estimate. The equations (3a.4.4) so constructed are called 
the normal equations. 


3a.5 Linear Functions with Zero Expectations 
Tf cyyy +--+ + enn is a linear function whose expectation is zero, 


then 
Cilj prises + CnOnj = 0 j n 1, ai | k (8a.5.1) 


Since the rank of (a;) is r, there are (n — r) independent sets (c, +t", 
Cn) which satisfy the equations (8a.5.1) (see 1a.4), which shows that 
there are (n — r) independent linear functions of the variables whose 


expectations are identically zero. ; 
If biyi +- +--+ bryn is the best estimate of a parametric function, then 


k 
bi = Dy tis 
jel 
> bici = Do N De Citij = 0 (38.5.2) 
1 Fi i 


Equation (3a.5.2) shows that the best estimates of parametric functions 


are uncorrelated with linear functions whose expectations are zero. 


The number of independent parametric functions that can be estimated 
isr, the rank of the matrix (aij), and their best estimates are uncorrelated 


with the (n — r) linear functions whose expectations are zero. 


* If Wi, «++, Mz is another solution leading to b's ++% 
= d; and b; — b'i = Ci, we have 
0 = (œ: &i)dı feet (apade i=l, sok 
cj = diaji ++ ++ + diik 


from which it follows that Y¢j” = 0. Hence c1 = ¢2 =-** = Cn = 0. 


b'n, then, defining 4; — Ni 


78 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


3a.6 Standard Errors of Estimates and Intrinsic Properties of 
Normal Equations 


Since Q; = (a;-y), it follows that 
VQ:) = (aiai)? 
cov (Q:Q;) = (ai: aj)? (32.6.1) 


Hence the variance of Q; is o? times the coefficient of 7; in the ith normal 
equation, and the covariance of Q;, Q; is o? times the coefficient of r; 
in the jth normal equation or of 7; in the ith normal equation. If the 
best estimate of pit, +:---+ putz is given by UQI +--+ L,Qx, then 

Di = (ay aly +-+-+ (ap, aæi)ly 7=1,2,-+->k (82.6.2) 
Using (3a.6.1), 


VQi +--+ uQ) 


ll 


PX (æi = aj) 
Plil; (æ; aj) 


lpi (3a.6.3) 


by virtue of (3a.6.2). Similarly the covariance of 21;Q;, Zm:Q;, the 
best estimates of Dp;r; and Zq;7;, is given by 


oe Zliq; = o =m; (8a.6.4) 


The formulae (82.6.3) and (3a.6.4) supply an easy method of evalu- 
ating the variances and covariances of the best estimates. Only the 


compounding coefficients of Qi, ---, Qk need be determined for the 
application of these formulae. 


If in the first equation the coefficient of 7; is reduced to unity, 
E(Q1) (a2: æ) (aj, æ) 
—— =at 2 


ma ea. eaa 


and 7; is eliminated from the rest of the equations by the method of 
sweep out (see 1a.3), then 


(a; a1) = (as or1) (aja) ] 

R(Q) = £4Q; — ——— = 5° a; — ———__—_ } 7 
EQ) f (@1-a1) a p> | ra) J” 
$= 2, 00+, k 


These become normal equations for the estimation of parametric func- 


SaaS 


PRINCIPLE OF SUBSTITUTION 79 


tions involving 72, +++, 7%. Also 


p Mike 
VQ) = fo: -27 au 
yay 
. | (raea) 
= 2 lesen = EE a 
yay 
= o? X the coefficient of 7; in the equation for Q;’ 
Similarly 
cov (Q,/Q,;') = o? X the coefficient of 7; in the equation for Q; 


= o? X the coefficient of 7; in the equation for Q; 


These properties may be termed “the intrinsic properties” of normal 
equations. From this it follows that, if the best estimates of rors + 
see rere amd sora fe ++ sete are faQ +-+++ f.Qx’ and gQ + 
+++ g.Q,/, then 
V (fQ +++ ++ Sex) = (foro ++ +++ Sire)? (8a.6.5) 
cov (Sf:Qi', PgR) = (fos2 H+ Sisko? 
(8a.6.6) 
= (gara +++ ++ gerne” 


These properties hold good when the parameters are successively elimi- 
nated by the method of sweep out. 


3a.7 Principle of Substitution 

Some amount of simplification can be effected in the computational 
methods of the foregoing analysis by adopting the principle of substi- 
tution or fitting of constants. Let tı, +++, tx be a solution of the equa- 


tions 
Q: = (aati H+ (aerate i=l, 2, ++, k (8a.7.1) 


If an estimable parametric function pıTı +++ +b Pete is estimated by 
iQ; +--+ ++ Qr, then 


Spits = De {(or-ees)r1 ++ ++ (ar os) TH} (8a.7.2) 
Substituting t for 7 in (3a.7.2), 
Dpiti = Dez {(ay-a)ty +++ ++ Car ai)te} 
= Qi Hee + Qk 


80 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


by virtue of the equations (3a.7.1). This shows that the best estimate 
of an estimable parametric function is obtained by substituting for the 
parameters 71, +++, Te any solution of the equations (3a.7.1). The 
solution for t; will be the best estimate of 7; only when 7; is estimable. 
The equations (3a.7.1) can be obtained formally by minimizing the sum 
of squares 

22 (ys — aai — +++ — ainte)? 


G 


with respect to ti, +++, tp 

If Zn; is non-estimable and Ēn;t; is homogeneous in y, then 
E(Znit:) # Eniti, for otherwise it contradicts the assumption of non- 
estimability. Also, the result of substituting a solution in an estimable 
parametric function leads to a homogeneous function of the observa- 
tions. Hence the necessary and sufficient conditions for Eniti to be the 
best estimate of Dnjr; is that (a) Enit: is homogeneous in y1, +++, Yn 
and (b) E(2nit;) = Znjz;. Also, if the result of substituting two dif- 
ferent solutions in Zn,r; leads to two different values, then Dn,7; is 
non-estimable. This supplies a sufficient condition for non-estimability. 

The set of normal equations are consistent in the sense that there 
always exist solutions 4, ---, tp satisfying them. To prove this it js 
sufficient to show that, if there exist quantities dj, ---, dy such that 
2di(a;-e;) = 0 for all j, then Zd:Q: = 0. Now 


Vd) = 07 Zd;Zd;(a;-a;) = 0 


which shows that the variance of a homogeneous function of stochastic 
variables is identically zero. This is not possible unless the compounding 
coefficients identically vanish, in which case 5d,Q; = 0. 

Since a single solution is sufficient for the purpose of substitution, 
we may add a set of consistent, convenient, or conventionally chosen 
equations to the normal equations and solve them. In many practical 
situations the normal equations have unique solutions, in which case 
all parametric functions are estimable. 

This aspect of normal equations is not properly brought out in litera- 
ture. Unnecessary restrictions * have been imposed on the rank of the 
observational equations to make all the unknown quantities estimable, 


in which case the normal equations have a unique solution. The above 
treatment covers the most general case. 


*This generalization was first noted by R. C. Bose, who developed a special 
method for estimating an assigned parametric function. 


The author (Rao, 1945) 
has shown that even when all the restrictions are withdrawn the least square tech- 
nique of deriving normal equations and substituting the solution in a parametric 


function works. The principle of least squares in estimation and derivation of 
statistical tests is thus valid under very general conditions. 


OBSERVATIONAL EQUATIONS WITH CORRELATED VARIABLES 81 


3a.8 Observational Equations with Linear Restrictions on 
Parameters : 
Sometimes it may be known that the parameters 7, ---, T in the 
observational equations (3a.1.1) satisfy some linear restrictions: 


gi = Tati tet TikTk p= 1, ses, (8a.8.1) 


In this situation two courses are open. It may be possible to eliminate 
some of the 7 parameters in the observational equations with the help 
of equations (3a.8.1) and obtain a different set of observational equa- 
tions with fewer + parameters having no restrictions. The theory 
developed above will then be applicable. Another method is to derive 
the normal equations by minimizing 


FA: — Gam mee aint)? 


subject to the restrictions (32.8.1). Introducing Lagrangian parameters 
li, +++, In, the normal equations (m + k) in number are 


Qi = (arra) eet (ær ai)te + brie bee +b btm 


=l ek (3a.8.2) 
and 
gj = Tats Hetet Tje j=1, =m 


The best estimate of any estimable parametric function pyry +: + 
Derr is simply pity +++ + Pete, where ti, «++, tẹ are chosen to satisfy 
equations (84.8.2). 

If the best estimate of pit: +*+ Pete iS obtained as ¢,Q; +--+ 
hQr + digi +++ dng, then its variance is simply (pic: +*+ 
Pkcr)o? as before. Similar expressions hold good for the covariances. 
The intrinsic properties considered in 3a.6 are also true. Equations 
(8.8.2) are always. soluble. Also, the best estimates of parametric 
functions are uncorrelated with the linear functions having zero expec- 
tations and with the estimates of the parametric functions on the right- 


hand side of equations (8a.8.1). 


3a.9 Observational Equations with Correlated Variables 
(3a.1.1) it was assumed that yı, ***, Yn are inde- 


In the setup of s 1 3 i 
pendent stochastic variables having a common variance o -. This con- 
hat the dispersion matrix is of the 


dition can be relaxed by assuming t za 
form o2A where the elements of A are all known and g” is an unknown 
multiplier. The observational equations may be written 


E(y) = TA’ 


82 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


where A = (a;;). The condition of unbiasedness of an estimate by’ of 
pt’ is the same as (8a.3.2) 


p=bA (8a.9.1) 
The variance of by’ is proportional to bAb’. Minimizing this expression, 
the best estimate is found to be 


mA ATIY (8a.9.2) 
where 


p=m4’/A71A 


This shows that, if pr’ is estimable, the estimate is given by pt’ where 
t satisfies the equation 


yATA = ATA (3a.9.3) 


which is similar to (3a.4.4). With the new definition Q = yA™™A the 

results of 3a.5, 3a.6, 3a.7, and 3a.8 hold good in the correlated case also. 

The equation (3a.9.3) can be obtained by minimizing the expression 
ZZD (ys — aati — + — aera) (Yj — Ajiri — + — ajeri) 


where (A¥) = A7!. If ther parameters are subject to some restrictions, 
then the above expression is minimized subject to these restrictions. 


3b Tests of Linear Hypotheses 
3b.1 Nature of Linear Hypotheses 


The data on which tests of significance are based consist of n inde- 


pendent observations yı, «++, yn with a common variance o? and having 
expectations * 


Ely) = aat ++ +++ ayers i=1, n (8b.1.1) 
where a;; are known and 7; are unknown parameters except that they 
may be known to satisfy a set of s restrictions, Ro. 

TuT bess rite = y1 


Ro: (8b.1.2) 


tan t sH = Ya 


These linear restrictions can be assumed to be independent, for, if not, 
they can be replaced by an independent set. 


* If E(yi) = aio + airi +--+ + airte then (yi — aio) can be considered to be the 
stochastic variable. 


TEST FOR Ho 83 


A linear hypothesis Ho specifies the values of one or more linear 
functions of parameters. 


huri eee hrk = ‘ 
Ho: . a : : (8b.1.3) 
hmı7ı pes + hmkTk = Om 


As in Ro, the linear functions in Ho can be assumed to be independent. 
Also, if some linear combination of the functions in Ho can be expressed 
in terms of the conditions in Ro, then they can be immediately verified. 
To start with, Ho may be replaced by a set of equations, no combination 
of which belongs to Ro (example 3 in 1b.1). Let Ho in (8b.1.3) be such 
a set. 


3b.2 Test for Ho 


If the hypothesis Ho is to be tested, it is necessary that all vectors 


hy, +--+, hm in Ho must belong to the vector space generated by the 
vectors 

ay = (a1, T aix) 

an = (ani, EPR ank) 

n= (ry tes, rik) 


t= (rs1; ee Tsk) 
This is the condition for estimability of the parametric functions in Ho. 
Let 2 
G) o?x?r +m be the minimum value of E(y; — aarti — +++ MikTh) 
0 ega 
when 7; are subject to the conditions Ro and Ho, 
Gi) o?xr? be the minimum value of D(yi H aiT — 
o ays 
when 7; are subject to the ene fa ra bic 
iii p I normally distributed. ; 

Ayta in 231. that xro ÍS distributed as x? with (n — r To 
degrees of freedom where 7 is the rank of T PA z r ay Pare 
: : sp is "1 7) Te 
is the number of independent vectors 10 e afi, abe Com 
lie entirely in the space of a1, °% bi amin a 3 i. of ria) wise 
sidered, there are (t + 7) independent poon Te e oe 
Ts, hy ya hm which can be expressed in terms of ay, tt an 
'] 3 m 


see dixte)? 


84 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


X r+ is distributed as x” with (n — r + t + m) degrees of freedom. 
Hence, as shown in 2d.1, 
2 2 2 
X Hol Ro = X Ro+Ho — XRo 


is also a x” with m degrees of freedom, the distribution being valid 
only when the hyopthesis Ho is true. If o? is known, then the x? distri- 
bution can be used to test Ho. On the other hand, the ratio 


2 
X"Ho| Ro , Xe 


F > 
m n—r+t 


is independent of o? and is distributed as a variance ratio with m and 
(n — r + t) degrees of freedom. Hence the hypothesis H, o can be tested, 
using the F distribution when o? is unknown. 


3b.3 Test for Hyp When Ro Is Not True 


In problems of the nature posed in 3b.1 it is often desirable not to 
take on trust the given restrictions but to test for them if possible. In 
this case all parametric functions in Rọ must be estimable. If the 
restrictions are true and are estimable, then xg? is distributed as x? 
with (n — r + s) degrees of freedom. On the other hand, the un- 
conditional minimum value x,” of (1/o?)Z(y; — ayty — +++ — Qixty)? is 
distributed as x? with (n — r) degrees of freedom. Hence a test for 
Ro is provided by the variance ratio 


S n=r 


based on s and (n — r) degrees of freedom. If this is significant, then 
Ro cannot be used in testing for Ho. In such a case all parametric 
functions in Ho must be directly estimable from the observational equa- 
tions (3b.1.1). The statistic Xm, the minimum value of á [Ely — 
aiT —* ++ — QigTh)? subject to Ho, is distributed as xX with (n — r + m) 
degrees of freedom. The test for Ho when Ro is not true is provided 
by the variance ratio 


Xue — xn? xe 


m 5 attr id 


based on m and (n — r) degrees of freedom. 

Example 1. Expressing the restrictions (3a.8.1) in the matrix form 
g = TR, show that the condition for the estimability of a parametric 
function pt’ is that there exist vectors b and ¢ such that 


p=bA+cR 


TRANSFORMATION TO UNWEIGHTED OBSERVATIONS 85 


Example 2. The best estimate of pt’ is given by AQ’ + cg’ where 
Q = yA’ and À, c are such that 


AAA + CR =p 
AR’ = 0 


Hence deduce the principle of least squares given in 3a.8. 
Example 3. Show that the minimum value of 


S(y — aati — Gate — +++)? 
is Dy? — Qi — Qa —- -+ —, Where ty, t2, +++ and Qi, Qe, +++ are as in 
(3a.7.1). 

Example 4. The minimum value of the expression in example 3 when 

the 7 parameters are subject to the relations (3a.8.1) is 
Dy? — 4Q1 — 2Q2—- ++ hagi — laga — +++ 
where ¢,, Qj, ly are as in (8a.8.2). 

Example 5. Let 21, 22) ***, 2m be the estimates of parametric functions 
in (3b.1.3) with the dispersion matrix oD. If c= (cq, ce, +++) isa 
Vector of arbitrary constants, then a linear compound of the m deviations 
from the hypothetical values in (3b.1.3) is 

c(z = 6)’ = ey (21 al 61) + C2(Z2 — 02) t+ “+ Cm(@m'— Om) 
with its variance o2cDe’. Show that the maximum value of the ratio 
{ole — 6)}2/o2eDe' is (2 — QDE — 8)'/0°. 7 : 

Example 6. Show that x" 1dr defined in 3b.2 is (z — 8)D~*(z — 8)'/o", 

the expression derived in example 5. (Hint: Follow the method of 


3c The Combination of Weighted Observations 


ghted Observations 

ares as discussed above can be extended 
variance but with known ratios. The 
++, Yn can be expressed as w10’, +++, 
and o” is an unknown parameter. 
ghted case by replacing yi by zi 


3c.1 Transformation to Unweii 

The general theory of least squ 
to observations having unequal 
Variances 012, +++, on” Of Yay °° "9 
W,7o? where w; are known quantities 4 
The problem is reduced to the unwel 
Where 


in which case 
sae (3¢.1.1) 


86 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


The general theory is applicable when the observational equations are 
considered as in (3c.1.1). 


3c.2 An Example of Weighted Observations 
Consider the observational equations 


Ely:) = tti Viyi) = x20? 


where the z; are known but o° is unknown. The transformed equations 
are 
Yi 
Ekg) = »(¥) a 
Ti 
The normal equation is 


EQ) = nr 
where Q = z;. The best estimate of 7 is 


o? 
_ v(2)-2 
n n n 


For testing whether 7 = ¢, an assigned quantity, the follow 


ing analysis 
of sum of squares is needed. 


Der = H? = n = 8? + Be; — 2)? 
XH = (Xm? — xe?) + xn? 

In the following data are considered the variables y;, the dry weight 
of paddy, and 2;, the green weight obtained from 25 samples. It is 
desired to test whether the conversion factor from green to dry is 34. 
The mean weight of dry paddy increases with the increase in green 
weight and so also the variance, The constancy of coefficient of varia- 


tion is a plausible hypothesis, and therefore the method developed above 
is applicable. 


n=235 5 5) =2¢) = 17.300 z= 0.692 


22? — nz = 0.2952 
n(z — £)? = (0.692 — 0.75)? X 25 = 0.0841 


Zz? = 11.9716 


TABLE 3¢.2c. Test for a Given Value of 7 


Mean 
Sum of Squares Square F 
DF. (S.8.) M.S.) Statistic 
Due to hypothesis 1 


n(2 — #)? = 0.0841 0.0841 
min 2(z; — 7)? = 2z; 
— næ = 0.2952 


6.837 
Residual n=1 


0.0123 


| 
| 


ASYMMETRY OF RIGHT AND LEFT FEMORA 87 


0.0841 0.2952 
F = —— + — = 6.837 
1 24 
The ratio 6.837 with 1 and 24 degrees of freedom is significant at the 
5% level, so we reject the hypothesis that the conversion factor is 34. 


3d Tests of Hypotheses with a Single Degree of Freedom 


3d.1 Students t Test 

Situations arise in which there is a single series of observations and 
it is desired to test whether the mean in the population is an assigned 
quantity yo. If n is the size of the sample and 7 the mean, the variance 
ratio appropriate to test the above hypothesis is 


re n(& — po)” 
Y= GaP — ni)/(m 1) 


with 1 and (n — 1) degrees of freedom. When the sum of squares in 
the numerator has a single degree of freedom as above, tables * have 


been constructed for the statistic 
Vn(& — uo) 
i= -c 
Vr? — né)/(n — 1) 
This statistic is also useful in testing whether the true mean value is 
above or below the hypothetical quantity yo. This test was first pro- 
posed by Student, t who demonstrated that an exact test of significance 


is possible when the standard deviation is unknown. This epoch-making 
discovery is the starting point of the exact sampling theory as developed 


by R. A. Fisher. 


34.2 Asymmetry of Right and Left Femora 
The mean difference (right femur — left femur) in length between the 

right and left femora of 36 skeletons of a certain series is found to be 

2.0234; the corrected sum of squares of these 36 differences is 418.6875. 


The estimated variance on 35 degrees of freedom is 


418.6875 _ 11.9625 


35 
* See Table III in “Statistical Tables for Biological, Agricultural, and Medical 
{The a h ore deny es S. Gosset wrote. ae reader is referred 
e pen na 4 aaa 
© Student's (1908) original paper where this test was first derive 


88 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


If the right and left femora are of equal length on the average, then 
the observed mean is a chance deviation from the true value, zero. 
The value of 


2.0234 AEE 
= > V 36 = 3.5193 
s 11.9625 


is significant at the 5% level so that on the basis of this test the lengths 
of the right and left femora cannot be considered equal on the average. 
Actually the probability of a tabulated ¢ value corresponds to the prob- 
ability that the absolute value of ż (irrespective of sign) exceeds the 
tabulated value. If it is known a priori that the alternative hypothesis 
is that the right femur is longer than the left or if the purpose of the 
test is to discriminate only the asymmetry due to the bigger length 
of the right femur, then the sign of t is important. The hypothesis of 
equality is rejected in favor of the suggested alternative only when ¢ 
exceeds the upper 5% value of t which corresponds to the 10% tabulated 
value of t. If the alternative hypothesis is that the left femur has a 
greater length, then (—é) should exceed the 10% tabulated value for 
significance. In the above example t certainly exceeds the upper 5% 
value of t, showing that the data are in agreement with the suggested 
alternative that the right femur is longer than the left. If the suggested 
alternative is the other way, the null hypothesis could not be rejected 
on the basis of the test utilizing the lower 5% value of t. In any problem 
the decision to use a two-sided or a one-sided test should be taken in 
advance before the analysis is undertaken considering the situation 
arising out of the problem at hand. 
These tests are useful in situations where the mean values of two 
series are to be compared, but the observations are such th 
a one-to-one correspondence betw 
member of the other. 
to the same skeleton. 
differences which can no 
is expected to be zero. 


at there is 
een a member of one series and a 
In the above example two measurements belong 
The 36 pairs of measurements give rise to 36 
w be treated as a single series in which the mean 


erence as the above may go 
The variance of femur length is 
ariance for difference in means of t 


undetected, even in a large sample. 
about 400, in which case the y 


WO 
independent series of 36 is 


400(s'¢ + alg) = 22.29 


ONE-WAY CLASSIFICATION 89 


whereas the corresponding variance for the correlated series is 11.9625 
=+ 36 = 0.33, which admits a more precise assessment of asymmetry. 
This aspect should be kept in view while conducting any investigation. 
When two measurements are to be compared, it may be designed to 
obtain two correlated series of measurements. The higher the corre- 
lation, the greater is the advantage. The association should be positive; 
otherwise the test based on correlated pairs becomes less efficient. 


3e Analysis of Variance 


3e.1 One-Way Classification 


Let there be k samples of sizes nı, +++, ny from k populations with 
unknown means y1, ***, up and with a common unknown variance o”. 
The hypothesis which may be desired to be tested is 


H = h == hk 


The observational equations nı +++ n; in number are given below 


First Sample kth Sample 
Observed Observed 
Variable Expectation Variable Expectation 
zi pı 133 Tki Ek 
Tiny H aii Tknk Uk 
Total Tı wee Th 


The minimum value of DS(x:;; — u)” subject to the condition of the 
hypothesis is 
WU m= beet ie (8e.1.1) 


which is the total corrected sum of squares of all the observations. The 
— p,)” without any restrictions is 


(ga) 


z p zp — ———— (8e.1.2) 


Ni 


minimum value of YE(vij 


The sum of squares due to deviation from the hypothesis 


T? Te 7 
(Be.1.1) — @el.2) = Tt ro. a (3e.1.3) 


T =T, +*+ Te 


90 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


is obtained from the totals for the samples only. This has (k — 1) 
degrees of freedom. It is easier to calculate the expressions (3e.1.1) 
and (3e.1.3) in practice and derive the expression (3e.1.2) by sub- 
traction. The scheme of computation is set out below. 


TABLE 8e.le. Analysis of Variance, One-Way Classification 


DF. S.S. 
Deviation from hypothesis wai JIE A a 
or between samples ne n 
Residual * + 
DX 1°? 
Total n-1 Dra? — = 


The quantities marked by * are obtained by subtraction. The F 
statistic is constructed using the mean squares derived from the above 
table. 

The following data relate to the head breadths of 142 skulls belonging 


to three series. Can the mean head breadth be considered the same in 
the three series? 


Head Breadth 
Sample =“ 
Series Size Total Mean 
1 83 11,277 135.87 i 
2 51 7,049 138.22 
3 8 1,102 137.75 
Total 142 19,428 136.817 


The sum of squares between series is 
11,277 X 135.87 + 7,049 X 138.22 


+ 1,102 X 137.75 — 19,428 X 136.817 = 238.59 


The total sum of squares is found to be 4616.64. 


The analysis of vari- 
ance is set out in Table 3e.18, 


TABLE 3e.18. Analysis of Variance 


D.F. S.S. MS. F 
Between series 2 238.59 119.29 3.79 
Residual 139 4378.05 31.50 
Total 141 4616.64 


The variance ratio 3.79 with 2 and 139 de 


at the 5% level so that the mean values 
all the series. 


grees of freedom is significant 
cannot be considered equal in 


TWO-WAY CLASSIFICATION 91 


When between series has 1 degree of freedom the square root of F 
can be referred to the ¢ distribution with degrees of freedom of the 
residual. If the two series can be distinguished as the first and the 
second, then ¢ can be given the same sign as the difference between the 
averages of the first and second series. Then as in 3d.1 it is possible 
to test whether the mean of the first series significantly exceeds the other, 
and vice versa. 


3e.2 Two-Way Classification with a Single Observation 
in a Cell 
Let there be pq observations, each of which can be specified in terms 
of the categories of two classes. The observations may be set out in 
the following tabular form. 


Class B 
Class A | Total 
Bı Be ©- Bp 
Al Tu T2 Sp | 2i 
As Ta T22 Top T2 
Ag Tal Tq? Tap Ta 
Total | 2.1 T.2 Bip a 


The observational equations are known to be 
E(w) =a +B; Vey) = 0° (8e.2.1) ` 


The a and £ parameters may be called the effects of the categories in 
the A and B classes. When only a single observation is present in each 
cell, it is not possible to test whether the additive setup assumed in 
(8c.2.1) is correct. If, however, this can be taken to be true, two 


hypotheses which may be tested from these data are 


paket. an, (3e.2.2) 

Bi = Bz =*= Bp 
It is easy to see that the rank of the matrix of equations (8e.2.1) is 
(p + q — 1) so that not all parametric functions can be estimated. As 
a matter of fact, the individual parameters «œ; and 6; are not estimable. 
But the differences of a; or 6; which are relevant to the hypotheses 
(8e.2.2) to be tested are estimable. The residual sum of squares is the 
minimum value of Z2(z;; — ai — Bi)” with pq — (p + g — 1) degrees 


i 


92 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 
of freedom. This is obtained as 


DU(ai — i — Ēj + Rye (8e.2.3) 
The minimum value of the sum of squares D£(z;; — a; — 6;)? subject 
to the restriction ay =---= a, is ‘ i 
D(a; — i — j +H) + — Er? —— g.. (3e.2.4) 
Pp pq 
with (pg — p) degrees of freedom. The sum of squares between the 
categories of class A, obtained by subtraction, is 
1 1 


= BDr;.? —— 2.2 (8e.2.5) 
P Pq 


with (g — 1) degrees of freedom. The sum of squares due to the cate- 
gories of class B is similarly 


1 i . 
=r. ——2.2 (3e.2.6) 
q Pq 

with (p — 1) degrees of freedom. The expressions (8e.2.5) and (3e.2.6) 


are easy to compute, and the expression (e.2.3) can be deduced from 
the equality 


2 Ey 2 
Dri; ae 
n 


= the sum of the expressions (8e.2.3), (8e.2.5), and (8e.2.6) 
The scheme of computation is presented below. 


TABLE 3e.2a. Analysis of Variance, Two-Way Classification 


D.F. 5.8. 

Between A classes q-1 Lagi at age 

p pa 
Between B classes p—1 Lapp — ce 

q Pq 

Residual (p — 1I)(q — 1) * 

si 1 
Total pq —1 Dr — — 2.2 

* Obtained by subtraction. Pq 


The analysis is similar for more complex classifications. The numerical 
methods of the analysis are given in 3g.3. 
3e.3 Two-Way Classification with Multip 


le but Equal Numbers 
in Cells 


In the last section it was shown that the differences in class effects 
can be tested when the effects due to the classes are known to be addi- 


EQUAL NUMBERS IN CELLS 93 


tive. This can, however, be tested when each cell contains more than 
one observation. If there are n observations in the (7, j)th cell, they 
may be represented by 


Tijis Vijay ***, Vijn 
with a total x:; and mean £;;. The observational equations are 
Elin) = oj; Vlei) = 0? 
The hypothesis to be tested is 
aij = ai + Bj 


If this is not true, there is said to be interaction, in which case the test 
of significance for the class effects cannot be properly interpreted. Dif- 
ferences in A classes might be tested for each B class. The magnitudes 
of these will depend on the nature of the B class considered. The 


residual sum of squares with pq(n — 1) degrees of freedom is 


min SDI — ay)? = DD 2 (wisn — T) 
rF F 


treating a;; as free parameters. The sums of squares due to the inter- 
action and the A or B classes are derivable as in 3e.2 by considering 
the totals x;; in each cell as single observations and dividing the final 
expressions of sums of squares by ^ to reduce to the scale of the original 
observations. By considering the obvious identity relations, the scheme 


of computation is given in Table 3e.3a. 


TABLE 3e.3a. Analysis of Variance, Two-Way Classification 


D.F. S.S. 
Pit 9 2 ¢ 
Between A classes gi ame a = me 
i rae 
Between B classes poi ae x me — m 
x 
Interaction (p — D4 -— 1) 
1 1 
Soa ee 
Between pq cells pg —1 no upd z 
* 
Residual palin — 1) 
1 EEr — Ag - 
Total npg — iji aon 


ee er eek 


94 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


The sum of squares indicated by * are to be filled in by subtraction. 
If the interaction is significant when tested against the residual, it may 
be necessary to test for differences in A classes for every B class, or 
vice versa. The sum of squares with (q — 1) degrees of freedom due 
to A classes for the jth class of B is 


Lf 1 
— te ae 
Nix ng 


The mean square corresponding to this is tested against the residual 
mean square. 

If the interaction is not significant, the first two entries of Table 
3e.3a can be tested against the residual mean square or the interaction 
mean square, whichever is greater. This is to guard against any bias 
due to small effects of interaction which could not be detected by the 
test but which is indicated when the interaction mean square exceeds 
the error. When both the mean Squares are of the same magnitude, 
a common estimate can be obtained by adding the degrees of freedom 
and the sum of squares corresponding to interaction and error. 


8e.4 Two-Way Classification with Unequal Numbers in Cells 
The following notation is used. 


tij = the total of all observations in the (i, 7)th cell. 
Tij = mean in the (7, j)th cell. 
T.j = )) ay; (total for the jth column). 

i 


zi. = È. zi (total for the ith row). 
J 


x.. = total of all observations. 


Ē.. = mean of all observations. 


Nij, Nj, Ni., and n.. are the numbers of observations for the (i, j)th 
cell, jth column, ith row, and all t 


he cells, respectively. 

The data in Table 3e.4a present the mean values and totals of nasal 
height of skulls excavated from three different strata by three observers. 
It is desired to test for the stratum and observer differences, 

In problems of this nature it is convenient to set up the figures as in 
Table 3e.4æ for the computation of the various sums of squares. The 
analysis is carried out in three stages. The total sum of squares with 
309 degrees of freedom is found to be 5398.4206. For further calcula- 
tions the entries in the above table are sufficient. 

A. The Computation of Between-Cell Sum of Squares. 
cell sum of squares with pg — 1 = 8 degrees of freedom 
@..%.. = 931.5204, 


The between- 
is Dlx; = 


UNEQUAL NUMBERS IN CELLS 95 
TABLE 3e.4a. Mean Values and Totals 
Strata 
Observer Ži Ži 
Sı Se S3 
zı 1,071.00 1,572.48 913.50 3,556.98 
O1 fy 51.00 49.14 50.75 50.098 
nu (21) (32) (18) (71) 
1,966.86 2,315.40 1,721.52 6,003.78 
O2 46.83 45.40 47.82 46.541 
(42) (51) (36) (129) 
1,219.00 2,091.60 1,849.20 5,159.80 
O3 48.76 46.48 46.23 46.907 
(25) (45) (40) (110) 
T.j 4,256.86 5,979.48 4,484.22 14,720.56 
tj 48.373 46.715 47.704 =r.. 47.4857 
(88) (128) (94) (310) =. 


of squares is very simple. 


B. The Computation of the Interaction Sum of Squares. When there 
are only two classes for A or B, the computation of the interaction sum 


Tape 3e.48. Mean Values in Cells and Weights 


Bi B2 Bp 
| Ay tu Ho Zp 
As a Ton Top 
Difference yı y2 Yp 
$ nuna nimo Nipp 
Weights nu Hna mig + ne Nip + Nop 
Total 
= w = w2 = Wp Dw; 
Difference ote 
X Weight wy woy2 WpYp LWiYi 
(Difference)? 3 r 
X Weight wy? wy? Wp Zwiyi 


Note: For obtaining wy?, we need to multiply wy by y- 


96 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


The interaction sum of squares with (2 — 1)(p — 1) degrees of freedom 
Ë Zwan? (Ewy)? 
my = 
Yi Zw; 


In the general case, for any number of A and B classes the absence 
of interaction means that 


Qij + arji — aij — aij = 0 
for all i #7’, j = j', where a;; is the expected value in the (i, j)th cell. 
The best estimate of this tetrad difference is 
Big + yj — Tij — ij 
To obtain the sum of squares due to interaction we can directly consider 
these functions and derive the suitable sum of squares. 
There are (p — 1)(q — 1) such independent functions whose variances 


and covariances can be easily written down. - Consider, for instance, 
a3 X 3 table with the mean values 


Žo 22 o3 
T31 T32 T33 
The following four functions indicated by the tetrad differences may 
be taken 
Yı = +811 + Za — 21 — Fo = 043 
Y2 = — foo + 12 + fog — F3 = 0.81 
Y3 = — Ëa + a + B32 — #1 = —0.85 
ys = + Toz — Bog — Be + s3 = —2.67 


If o?(a,;) is the covariance matrix of y1, Y2, yg, ya, then 


1 1 1 1 


t = + + + = 0.122286 
ni N22 Na niz 
1 1 
CoS Se = —0.050858 
N22 Nye 
il 1 
= SS = —0.043417 


N22 Na 


UNEQUAL NUMBERS IN CELLS 97 


Qi = — = 0.019608 


1 1 1 1 


a22 = += + — = 0.134192 
Ngo m2 N23 213 
1 
Q93 = = 0.019608 
N22 
iL 1 
ima = —0.047386 
N22 n23 


if 1 1 1 
= + — = 0.105639 


433 = 
no Na g2 MNBL 
1 1 
y S = —0.041830 
n22 n32 
J 1 1 1 
a44 = T = 0.094608 
N22 N23 N32 233 


Table 3e.4y contains the required computations. The matrix aj; is 
written with an extended column of y and reduced by the method of 
pivotal condensation (1d.1). 

The elements below the diagonal are omitted because the matrix is 
symmetrical at each stage. It is sometimes necessary to retain a large 


number of decimal places to obtain sufficient accuracy in the final value. 
More examples illustrating this computational scheme are given in 
Chapter 7. 


The last pivotal value with the sign changed, 125.5049, is the sum 


of squares for interaction. An alternative way of calculating the inter- 
action sum of squares is by fitting constants by the method of least 
squares. We need to find the minimum value of 


Diny — ti — B;)° 


If a; and bj are the optimum values, then the minimum value is 


98 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


TABLE 8e.4y. Method of Pivotal Condensation for Interaction 
Sum of Squares 


Matrix aij Value of y 
0.122286 —0.050858 -—0.043417 0.019608 0.43 
0.134192 0.019608 —0.047386 0.81 
0.105639 —0.041830 | — 0.85 
0.094608 | — 2.67 
0 
1 —0.415894 —0.355045 0.160345 3.516347 
0.113040 0.001551 —0.039231 0.988834 
0.090224 —0.034868 — 0.697331 
0.091464 | — 2.738949 
— 1.512029 
1 0.013721 —0.347054 8.747647 
9.090203 —0.034330 | — 0.710899 
0.077849 | — 2.395770 
— 10.162000 
1 —0.380586 | — 7.881102 
0.064783 | — 2.666328 
— 15.764667 
1 — 41.15783 


— 125.504944 


The optimum values are obtained from the equations 


Marginal 
ay az a3 by be bs Total 
71 : : 21 32 18 = 3556.98 

: 129 : 42 51 36 = 6003.78 
: : 110 25 45 40 = 5159.80 
21 42 25 88 : = 4256.86 
32 51 45 : 128 : = 5979.48 
18 36 40 : : 94 = 4484.29 


The method of writing down these e 


quations is very simple. Start with 
any marginal total, say 3556.98, b 


ased on 71 observations distributed 


UNEQUAL NUMBERS IN CELLS 99 


in the B classes as 21, 32, and 18. This gives the first equation. There 
are six marginal totals corresponding to A and B classes, giving rise to 
six equations. 

The method of solution is also simple. First reduce the first three 
equations by making the coefficients of a1, a2, ag equal to unity. The 
value of bs can be assumed to be zero so that the column corresponding 
to bg (the last constant) may be omitted. 


Marginal 
a az a3 by be Value 
1 : . 0.295775 0.450704 50.098310 
. 1 . 0.325581 0.395349 46.540930 
. 1 0.227273 0.409091 46.907272 


Subtracting from the fourth row of the original equations 21 (first row 
above) + 42 (second row above) + 25 (third row above), and similarly 
from the fifth row subtracting 32 (first row) + 51 (second row) + 45 
(third row), we obtain 


62.432498 —36.296717 77.394630 
—36.296717 64.844662 —108.080590 


` which give simultaneous equations in bi, be with solutions 
bı = 0.559250 bə = — 1.170335 bs = 0 
Substituting them in the first three reduced equations with unit coeffi- 
cients for aj, a2, az, we find the values of a1, a2, a3 to be 


a, = 50.460372 az = 46.821539 az = 47.258943 


The minimum sum of squares is then computed: 
Uncorrected between-cell sum of squares 
— a, (3556.98) — a2(6003.78) — >+- 


—b (4256.86) — b2(5979.48) — =: 
= 125.4490 


The method of solving the above equations is quite general. First, 
omit one b if the number of b classes is greater than or equal to that of 
a, and reduce the coefficients of a to unity in the first set and eliminate 
a from the latter set. The resulting equations contain b only, which 
may be directly solved. Also, the last equation may be omitted or it 
may be retained as a check, since the column sums should be zero. 
The a coefficients can be obtained by substitution for b in the first set 


100 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


of reduced equations. The value 125.4490 for interaction sum of squares 
agrees with that obtained earlier only to three significant figures. This 
is the order of accuracy expected in computations of this nature unless, 
of course, the calculations are carried to a large number of decimal 
places. ‘This is unnecessary because the data, to start with, may not 
be so accurate, in which case there is a limit to the number of significant 
figures in any computed value. Even in the above case the number 
of decimal places retained could be reduced, but it is a good arithmetical 
discipline to assume that the original figures are absolutely correct and 
carry out the computations to as many decimal places as can be con- 
veniently retained. In this particular case the average of the two 
values is 125.4769, which is used in Table 3e.4e. In general, the first 
method needs the reduction of a matrix of L — 1)(@— 1) + Ith 
order; the second leads to a solution of simultaneous equations of order 
(p+q—1). The latter method may be relatively simpler when p and 
q exceed 3. 

C. The Computation of Main Effects. If the interaction is significant, 
then the problem reduces to testing the differences in each row or in 
each column of the two-way table. On the other hand, when the inter- 
action is not significant, the main effects may be tested by considering 
all the table entries. 5 

As shown above, the interaction sum of squares is the minimum value 
of 2Enj;(%:; — ai — Bj)? when minimized with respect to œ and 8. If 
now this quantity is minimized with the further restriction that a, = 

** = ay, it is easily seen that the minimum value is 


2 2 2 2 
(2) 24} f(e2)-24 
Nij n.. Nig N.. 
This can be recognized as the total sum of squares between the pq cells 
minus the sum of squares between the B classes, ignoring the classifi- 
cation due to A. If the interaction sum of squares is subtracted from 


this, the valid sum of squares for testing the differences in A classes is 
obtained. The scheme of computation is shown in Table 3e.46. 


TABLE 3e.45. Sum of Squares for A Classes 


DF. S.S. 
Between B classes ignoring A p-1 Seat eae 
Tnters aon. -Du — 1) (As obtained in stage B) 
Between A classes q-1 Š 
— a 
Total between cells pq—1 Death = Erd, 


* Obtained by subtraction. 


UNEQUAL NUMBERS IN CELLS 101 


Similarly, the valid sum of squares between B classes is obtained. The 
mean squares obtained from each of these sums of squares can be tested 
against the residual or the interaction mean square, whichever is 
greater. This completes the analysis. 

The final analysis of variance is given in Table 3e.4e. 


TABLE 3e.4e. Complete Analysis of Variance for the Two-Way Data 


DF. SS. M.S. MS. S.S. D.F. 

Strata, 2 | 147.6319 73.8159 || 317.0758 | 634.1516 2 | Observers, 
ignoring ignoring 
observers strata 

Interaction 4 | 125.4769 31.3692 125.4769 4 | Interaction 

Observers 2 | 658.4116 * | 329.2058 || 85.9459 |171.8919 *| 2 | Strata 

Between 8 | 931.5204 931.5204 8 
cells 


Within cells} 301 | 4466.9002 *| 14.8402 


Total 309 | 5398.4206 


* Obtained by subtraction. 


The variance ratio for interaction is 


31.3692 
14.8402 


= 2.11 


which with 4 and 301 degrees of freedom is not significant. The inter- 
reater than the within-cell mean square and is 


action mean square is g 1 
g for observer and stratum differences. The 


therefore used in testin; 
variance ratio for observers 18 


329.2058 = 10.49 


31.3692 


which with 2 and 4 degrees of freedom is significant at the 5% level. 


The variance ratio for strata is 
85.9459 


ee 24, 


31.3692 > 


Which with 2 and 4 degrees of freedom is not significant. 


102 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


Thus the discrepancy in the mean values can be traced to observer 
differences. Probably the observers used different techniques of measur- 
ing nasal height. 

The calculations become very simple when the unequal cell numbers 
are proportionate so that 

NNj 


nj = 
Ne. 
In this case the main effects of A and B can be calculated in the usual 
manner from the marginal totals. Thus the sum of squares due to A is 
ti? GF 


> Sire 


Ni- N.. 


The interaction sum of squares is obtained by subtracting the sum of 
squares due to A and B from the total for the pq cells. 

Suppose that we ignore the observers and look for stratum differences. 
The analysis will then be in the nature of 


D.F. M.S. Ratio 
Between strata 2 73.8159 4.31 
Within strata 307 17.1035 


which gives a significant ratio of 4.31, leading to the conclusion that 
three different groups of people inhabited the three different strata. 
A closer analysis reveals that observer differences are also important 


so that some caution is necessary in combining the results of the three 
different investigators. 


3f The Theory of Statistical Regression 
3f.1 The Concept of Regression 


The theory of regression is concerned with the derivation of the 
relationship between a set of variables Tı, ***, &p and the mean value 
of another variable y observable with them. The variables 2, 
are called concomitant variables, and y 
equation 


++) tp 
the dependent variable. The 
m = Ray, «++, £p) 


giving m the mean value of y for 
the regression equation of y on 21, 
above equation as 


given z, as a function of x, is called 
***;%p. It is customary to write the 


Y= Ra, +++, 2p) 


PREDICTION OF CRANIAL CAPACITY 103 


Various uses of the regression are discussed below with suitable ex- 
amples. 


3f.2 Prediction of Cranial Capacity 

One of the uses of the regression equation is for the prediction of the 
dependent variate for a given set of concomitant variates. For instance, 
a skull may be broken so that the actual cranial capacity cannot be 
determined. In such a case the capacity may be predictable if at least 
some external measurements are available. This requires the construc- 
tion of the regression equation between the cranial capacity and the 
observed set of external measurements on complete skulls. 

The Regression Equation. Three important measurements from which 
the cranial capacity (C) may be predicted are the glabella-occipital 
length (L), the maximum parietal breadth (B), and the basio-bregmatic 
height (H’). Since the magnitude to be estimated is a volume, it is 
appropriate to set up a regression formula of the type 


=a LP BPs F's 
where a’, B1, B2, and 83 are the constants to be estimated. By trans- 
forming the variables to 


y =log0C zı =logioL x2 =logioB 23 = logio H’ 


the formula can be written 
y = a+ Bir, + Bote + Bars 
where a = logio a’. From this equation the constants are estimated by 


the method of least squares. 
Estimation of the Constants. Using the measurements on the 86 male 


skulls from the Farringdon Street series (Hooke, 1926), we find the 
mean values 


g = 3.1685 #, = 2.2752 Bo = 2.1523 #3 = 2.1128 
The corrected sums of the products matrix (S;;) for a1, 22, £3 is 


0.01875 0.00848 0.00684 
0.00848 0.02904 0.00878 
0.00684 0.00878 0.02886 


The corrected sums of products of y with 21, £2, and 23 are, respectively, 


Qı = 0.03030 Q = 0.04410 Qs = 0.03629 


104 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


The reciprocal of the matrix (S;;) denoted by (Cy) * is obtained by 
the method of 1d.1. 


64.21 —15.57 — 10.49 
—15.57 41.71 — 9:00 
—10.49 — 9.00 39.88 


The estimates of the parameters are 


bi = 64.21Q; — 15.57Q2 — 10.49Q3 = 0.878 
bg = —15.57Q; + 41.71Q2 — 9.00Q3 = 1.041 
bs = —10.49Q1 — 9.00Q2 + 39.88Q3 = 0.733 


a = ĵ — bit, — bet — bg%3 = —2.618 
The formula for the prediction of cranial capacity + is 
C = 0.00241 19-878 B1-041 770.733 


Since the estimate of B; is C1:Qi + CoQ. + C3, 


Qz, it follows from 
(3.6.5) and (3a.6.6) that 


V(b;) = Cio? and cov (b,D;) = Cio” 
where C;; are elements of the above matrix. 

Tests of Hypotheses. Having estimated these constants, it is rele- 
vant to examine how far the concomitant variables are helpful in pre- 
diction. If these variables are of no use, then the prediction formula 
does not depend on them so that Bı = B2 = B3 = 0. This hypothesis 
may be tested from the above data. 


The residual sum of squares with (n — 4) degrees of freedom is the 
minimum value of 


ZYi — æ — Bir: — Boro; — Byr3;)? 
which is 


(Zy? — ng?) — b1Q1 — boQe — baQ 


= 0.12692 — 0.878(0.03030) — 1.041(0.04410) — 0.733 (0.03629) 
= 0.12692 — 0.09911 = 0.02781 


* To be consistent with the matrix notation the reciprocal (S;;) should be written 
(S*). In statistical literature this is already known as the C matrix with the elements 
Cy. 

t The capacity of the Farringdon Street serie 
packing with mustard seed and weighing in the 
(1904). The formula is applicable only for predictin, 


s skulls was determined by tight 
manner described by Macdonell 
capacity determined in this way. 


PREDICTION OF CRANIAL CAPACITY 105 


If the hypothesis 8; = 82 = 83 = 0 is true, then the minimum value 
of S(y; — a)? is Sy2 — ng? = 0.12692, which is the total sum of squares 
with (n — 1) degrees of freedom. The reduction in the above sum of 
squares is due to regression. The analysis of the sum of squares is 
shown in Table 3f.2a. 


TABLE 3f.2a. Test of the Hypothesis 6; = $2 = Bs = 0 


D.F. SS. MS. F 
Regression 3 0.09911 0.033037 97.41 
Residual 82 0.02781 0.0003391 . 
Total 85 0.12692 


The variance ratio 97.41 with 3 and 82 degrees of freedom is significant 
at the 1% level, which shows that the variables considered above are, 
useful in prediction. 

It may now be examined whether the three linear dimensions appear 
to the same degree in the prediction formula. From the estimates it 
is seen that the index bs for maximum parietal breadth is higher than 
the others. This means that a given ratio of increase in breadth counts 
more for capacity than the corresponding increase in length or height. 

The hypothesis relevant to examine this point is 


6, =ßb2=ß3=8 (say) 


The minimum value of £ fy; — æ — B(t1i + t2: + 3) }? has to be found 
out. The normal equation giving the estimate of £ is 


{Siz + S22 + S33 + 2(Si2 + S23 + Ss1)}b = Q = (Qi + Q2 + Qs) 
0.12485b = 0.11069 
b = 0.8866 


The minimum value with (n — 2) degrees of freedom is 


(Zy? — ng?) — DQ = 0.12692 — 0.09814 = 0.02878 


Taste 3f.28. Test of the Hypothesis 61 = 62 = Bs 


D.F. SS. MS. F 


Deviation from equality 2 0.00097 0.000485 1.430 
8 


Residual 0.02781 0.0003391 


Total 84 0.02878 


106 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


The ratio is not significant, so there is no evidence from the data to 
conclude that £1, b2, 83 are different. The differences, if any, are likely 
to be small, and a large collection of measurements may be necessary 
before anything definite can be said about this. Evolutionists believe 
that the breadth is increasing relatively more than any other magnitude 
on the skull. If this is true it is of interest to examine how far the 
cranial capacity is influenced by the breadth. 
So far as the problem of prediction is concerned, the formula 


C = 0.002342(LBH’)°-886 


obtained by assuming 6; = B2 = B3, may be as useful as the formula 
derived without assuming that these are equal. The variance of the 
estimate b of £ is o’”/ZDS;; where o’? is the estimate based on 84 degrees 
of freedom with the corresponding sum of squares given in Table 
3.28. 

A simple formula of the type C = a’LBH’ is sometimes used for pre- 
dicting the cranial capacity. A test of the adequacy of such a formula 
is equivalent to testing the hypothesis 


Bi = Be = Bg = 1 


The minimum value of (y; — æ — By; — Boro; — B3vgi)", assum- 
ing this to be true, is 


(2y? — ng?) + S11 + S22 + S33 + 2(S12 + S23 + S31) 


—2(Q1 + Q2 + Qs) = 0.03039 


which has (n — 1) degrees of freedom. ‘The residual has (n — 4) 
degrees of freedom so that the difference with 3 degrees of freedom is 
due to deviation from the hypothesis. 


TABLE 3f.2y. Test of the Hypothesis 6; = B2 = 83 = 1 


D.F. S.S. M.S. F 
Deviation from 
Bi = Bo = Bg = 1 3 0.00258 0.00086 2.544 
Residual 82 0.02781 0.0003391 
Total 85 0.03039 


The ratio 2.544 with 3 and 82 degrees of freedom is 
significance level. It would be of interest to ex 
more adequate material. 

In Table 3f.2y the sum of squares due to deviation from the hypothesis 
could be directly calculated from the formula, providing a compound 


just below the 5% 
amine this point with 


PREDICTION OF CRANIAL CAPACITY 107 


measure of the deviations of the estimates b,, be, ba from the expected 
values 1, 1, 1. 


ZISi(b; — 10; — D 
= (bı — 1) {S11 (b1 — 1) + S122 — 1) + Sis(bs — 1)} +--+ 
= (6) — 1)(@1 — Su = Siz — Sis) ++ 
= 01Q1 + bQ + b3Q3 — 2(Q1 + Q2 + Qs) + ZZS; 
= 0.09911 — 2(0.11069) + 0.12485 


= 0.00258 


which is the same as that given in Table 3f.2y. 

Having found evidence that the 6 coefficients differ individually from 
unity, it is of some interest to examine whether the indices add up to 
3 while distributing unequally among the three dimensions used. This 
requires the test of the hypothesis 8; + 62 + 83 = 3. The best esti- 
mate of the deviation is bı + b2 + b3 — 3 = 2.652 — 3 = —0.348. 


V (by + be + bs — 3) = (BECij)0” = 75.680" 
The ratio with 1 and 82 degrees of freedom is 


(0.348)? 1 nai 
75.68 ~  0.0003391 


which is significant at the 5% level. This shows that the number of 


dimensions of the prediction formula is not 3. i 
It is often desirable to test whether the inclusion of an extra variable 


increases the accuracy of prediction. For instance, in the above ex- 
ample we can test whether H’ is necessary when L and B have already 
been considered. This is equivalent to testing whether Bs =0. The 
estimate ba = 0.733 has the variance C330°. The ratio with 1 and 82 
degrees of freedom is y 
ba? (0.733)? ? 1 
Caso? 39.88 “` 0.0003391 


0.01347 _ 
~ 0.0003391 


where for o? the estimate based on 82 degrees of freedom is used. 
This is significant at the 1% level, showing that H’ is also relevant. 


108 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


If bs were not significant, the sum of squares due to bz 


2 


b 
= = 0.01347 
C33 


could be added to the residual sum of squares 0.02781 to obtain a sum 
0.04128 based on 83 degrees of freedom, giving the estimate of o? = 
0.0004973. 


Tf bz is declared to be zero, the best estimates of b, and bz have to be 
revised, starting with the equation 


Y = a + Bix + Boxe 


It is, however, not necessary to start afresh. The C matrix 


64.21 —15.57 — 10.49 
— 15.57 41.71 — 9.00 
—10.49 — 9.00 39.88 


is reduced by the method of pivotal condensation, starting from the 
last row and using C33 as the pivot. 


64.21 — = —15.57 — oe) 
9.00) (10.49 9.00)? 
oe E eg = 
p | 61.45 a] 
— 17.94 39.68 
which gives the reduced C matrix for the evaluation of by and bo. 

bı = 61.45Q, — 17.94Q2 = 1.071 

be = —17.94Q; + 39.68Q2 = 1.206 

a = J — byt — bz = —1.864 


The residual sum of squares is 


Z(y — 7)? — b1Q1 — beQe = 0.0004973 


which agrees with the value obtained by adding the sum of squares due 


to bg to the residual, thus providing a check on the calculation of b; and 
bə. The variance-covariance matrix of bı, b2 is o? times the new C matrix. 

If more variables are omitted, the method of pivotal condensation 
has to be carried further. The reduced matrix at each stage gives the 


PREDICTION OF CRANIAL CAPACITY 109 


C matrix appropriate to the retained variables. It would avoid some 
confusion if the C matrix could be written in the order in which the 
variables are eliminated before attempting the method of pivotal con- 
densation. Thus if xı, x2, #3, x4 ave the original variables and if Tə 
and {q4 are to be eliminated, we may write the C matrix as 


Con Cog Cor Cos 
Cyn Cu Cu Cas 


C32 C34 Car C33 
which is obtained by bringing the second and fourth rows and columns 


to the first two positions. Now this matrix can be reduced by the 
method of forward pivotal condensation. 

If the order in which the variables are to be included in the regression 
equation is assigned, the successive regression equations can be ob- 
tained by following the computational method of Table 7b.68 in Chap- 


ter 7. 


The Use of the Formula for Predicting the Capacity of a Single Skull. 
A skull with L = 198.5, B = 147, H' = 131, ie, sı = 2.208, ts = 
2.167, 23 = 2.117, will have the estimated log capacity 


Y = 9 + di(x1 — 21) + boxe — #2) + b3(e3 — č) 
= 3.2069 
C = antilog 3.2069 = 1610 


2 
VY) = {= + T2(x; — £;) (xj — Tj) cov bas} 
n 


2 
Z p PZE(a; — i) (2 — 2;)Ci 
n 


o7(0.04187) = 0.0003391(0.04187) = 0.0001420 


The estimated value of o? is obtained from the residual line in Table 


f.2x. 


V(C) =C?V(y) approximately * 
= 195.2 ¢ 


* A general method of obtaining the variances of transformed variables is given 


in 5e.1, 


110 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


It is seen that in the above formula for variance of the estimated 
value (Y) the precision of the estimate depends on the closeness of 
21, £2, z3 to the averages Tı, Tə, Zz realized in the sample used to con- 
struct the prediction formula. In fact, the variance is least, equal to 
o*/n, for the estimate when the measurements on the specimen coincide 
with the average values. The accuracy of prediction diminishes as 
£1, T2, £3 depart more and more from the average values, and the pre- 
diction may become completely unreliable if 2, x2, x3 fall outside the 
range of values observed in the sample. 

For prediction with a single variable the regression equation is 


Y =a+ W(x — žē) 


=)2 
V) = [+ a" f 

n Su 
where S11 is the corrected sum of squares for x in the observed sample. 
As before, accuracy is higher for prediction near the mean. The formula 
also depends on S11, the scatter of x in the sample; the larger the scatter, 
the higher is the precision of the estimate for any x. Therefore, in 
choosing the sample for the construction of the prediction formula, the 
values of x at the extremities of its range should be observed if a best 
prediction formula is to be constructed. This is, no doubt, a theoreti- 
cally sound policy which can be carried out with advantage when it 
is known for certain that the regression equation is of the linear form. 
In fact, data collected in such a manner are not suitable for judging 
whether the regression is linear or not, and there is no reason to believe 
that linearity of regression is universal. The biometric experience is 
that the regressions are very nearly linear, deviations from linearity 
being detectable only in large samples. If this is so, the regression 
line fitted to the data is only an approximation to the true regression 
function, and the data should allow a closest possible fit of the straight 
line to the ideal curve. The best plan for this is to choose z from all 
over its range or, preferably, to choose x at random so that different 
values of x may occur with their own probability and exert their in- 
fluence in the determination of the straight-line fit. 

The Use of the Formula in Estimating the Mean Capacity. The for- 
mula can also be used to estimate the mean cranial capacity of a series 
of skulls. For this purpose two methods are available. We may esti- 
mate the cranial capacity of individual skulls and calculate the mean 
of these estimates, or we may apply the formula directly to the mean 
values of L, B, H’ for the series. It is of interest to know whether 
these two methods give the same results. For this purpose estimates 


and 


PREDICTION OF CRANIAL CAPACITY 111 


were made of the mean cranial capacity of an additional 29 male skulls 
of the Farringdon Street series for which measurements of L, B, H’ 
but not of C were available. 

For these 29 skulls the mean of L is 191.1, of B is 143.1, and of H’ 
is 129.0. 

Applying the formula C = 0.00241Z°-8’8B!-°4177/9-733 to these mean 
values, we estimate the mean of C to be 1498.3. If we estimate C 
for the 29 skulls individually and take the mean of the 29 estimates, 
we get an estimate of the mean value of C equal to 1498.2. 

The same estimates were calculated for the 22 male skulls of the 
Moorfields series (Hooke, 1926) for which all four measurements were 
available. For these 22 skulls the mean of L is 189.5, of B is 142.5, and 
of H’ is 128.8, giving an estimate of the mean of C equal to 1479.0. If 
we estimate C for the 22 skulls individually and calculate the mean, 
we get an estimate of the mean of C equal to 1480.0. Thus it appears 
that the two methods give very nearly the same estimates. 

Are Only Small Skulls Preserved? A point of some interest is that, 
whereas the mean value of C for the Farringdon Street series as calcu- 
lated from 86 measured values is 1481.3, the mean value of C as esti- 
mated by our formula from the 29 skulls for which measurements of 
L, B, H’ but not of C are available is 1498.3. 

Again, for the Moorfields series the mean of L is 189.2 based on 44 
measurements, of B is 143.0 based on 46 measurements, and of H’ is 
129.8 based on 34 measurements. Applying our formula to these mean 
values (as we may do with some confidence as shown above), we obtain 
an estimate of the mean of C equal to 1490.7. The mean of C as calcu- 
lated from 22 measured values is only 1473.8. 

The above results suggest that those skulls which are damaged to 
such an extent that the cranial capacity cannot be measured are on the 
whole larger than those that remain intact. 

This raises a serious issue. Are not the published mean values of 
cranial capacities gross underestimates? Can a suitable method be 
Suggested to correct these values? One way would be to use the samples 
Providing observations on C, L, B, and H’ for merely constructing the 
prediction formula. As observed earlier, the prediction formula, pro- 
Vided the nature of the regression function used is appropriate, could 
be obtained from samples providing observations on all the measure- 
ments although the samples are not drawn at random from the pope 
ation. For instance, if only small skulls are preserved, the measure- 
ments obtained are not strictly random from the population of skulls. 

uch material is being used just to establish a relationship. Having 
Obtained this formula, the mean values of L, B, H’ obtained from all 


112 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


the available measurements may be substituted to obtain an estimate 
of the mean capacity. This value will be higher than the average of 
the available measurements of the cranial capacity but less than the 
predicted value on the basis of mean L, B, H’ from skulls providing 
these measurements only. 

The extent of underestimation depends on the proportion of the dis- 
integration of the large skulls. This may vary from series to series, 
and hence for a proper comparison of the mean capacities the correction 
indicated above may have to be applied. 


3f.3 Test for the Equality of Regression Equations 

It is very often necessary to test whether regression functions con- 
structed from two series are the same. Thus if the formulae for the pre- 
diction of the cranial capacity from a different series is 


y =a + biti + bo'ta + bzta 
two types of hypotheses may be tested whether, in the expectation, 
(i) a=a' by = by’ bg = be’ bg = bg! 


(ii) by = by’ bo = be! bz = ba’ irrespective of whether a 
equals a’ or not 


If the former is true, then the whole regression function is the same in 
both series; if the latter is true, the regression functions are the same 
apart from a change in the constant. These two hypotheses are relevant 
because many problems arise where a prediction formula constructed 
from one series may have to be used for a specimen from an entirely 
different series. An extreme and rather ambitious case of such a use 
is the prediction of stature of prehistoric men from the length of fossil 
femur by using a formula connecting the stature of modern man with 
the length of his long bones (K. Pearson, 1898). Some sort of justi- 
fication for such a procedure will be available if the first hypothesis is 
proved to be correct in analogous situations. We first deal with the 
test procedures when the prediction formulae are available for both the 
series. 
Let the derived quantities for the second series be: 


Sample size: n . 
Mean values: =’, Zo’, &s’ and 7’ 
Corrected sums of products: 


2 (te — @/) (23° — 3) = Si 


Zar! — €/)(yr" — 7!) = Q 


| . THE EQUALITY OF REGRESSION EQUATIONS 113 
| These are sufficient to determine the regression function. 
y = a + by'x + b'ta + b3't3 
The residual sum of squares 
Ro? = Ey — 7’)? — b1’Qi' — bQ’ — bz'Qs’ (for the second sample) 
+ Ely, — 7)? — b1Q1 — b2Qe2 — b3Q3 (for the first sample) 
has (n’ — 4) + (n — 4) = (n + n’ — 8) degrees of freedom. 


We now throw the two samples together and consider them as a 
single sample of size (n + n’) and determine the regression line and 
residual sum of squares by using the above formula. The necessary 
quantities can be computed from those already available: 


Sample size: n+n’ 


=. tt 
Mean values: fe = ī;" 
(n + n) 
Corrected sums of products: 
n" ; nn’ Nir =f 
Sy” = Si + Si bos (@ — &,)(&; — &;') 
mo epee 
Q” = Q:4+ Qi + (Z: — &/')(9 — g’) 
n +n 


If bi”, ba” «++ are the regression coefficients, then the residual sum 
of squares R,? with (n + n’ — 4) degrees of freedom is 


nn” 


ga) 
=b/"0" = bz” Q” =~ b3” Q3” 


Ely’! — gn Ş g2 
U= gP + Br - D H 


We can set up the analysis of variance table. 


TABLE 3f.3æ. Analysis of Variance for Testing Equality of Regression 
Coefficients 


Residual Due to D.F. S.S. 


Deviation from hypothesis 
Separate regressions 


Common regression 


* Obtained by subtraction. 


114 LINEAR ESTIMATION AND TESTS OF HYPOTHESES . 


The significance of the ratio of mean square due to deviation from 
hypothesis to residual due to separate regressions is tested. 
If the object is to test for the equality of the b coefficients only, we 
calculate the quantities 
Q” = Q+ Sy” = Syt Si 
and obtain the constants b1”, be’”’, b3” from the equations 
Oy” = dB” t basg” { b3S3;/”” i = i; 2, 3 


and find the residual sum of squares Ra? with (n + n' — 5) degrees of 
freedom. 


R? = Ly, = 9)” + Dy," = TO Le BY Oy sè bOan = 55 b3Q3!” 

The test depends on the variance ratio 
R2 = Ro? R Ro? 
3 "ntn =B 

with 3 and (n + n’ — 8) degrees of freedom. 

In biological data it is often found that the mutual correlations and 
variabilities of measurements are approximately the same for all allied 
series, in which case the coefficients b4, be, ba in the regression formula 


will not differ much. On the other hand the mean values differ to some 


extent from series to series, in which case the equality of the constant 
term means that the expected value of 


Y — Bit, — Bote — Bgrg 


is the same for both the series. This leads us to consider a different 
problem whether a = « when f; = fi’, B2 = Bo’, +++. A test for this 
can be immediately obtained from the sums of squares calculated above. 
The suitable statistic is the variance ratio 

R? — R? R? 


1 n+n'—5 


with 1 and (n + n’ — 5) degrees of freedom. If the above hypothesis 
is true, then the difference in the mean value of y could be completely 
explained by differences in the other variables z, £2, 3. This problem 
is considered more fully in Chapter 7 (7b.6). It appears that when a 
sufficient number of measurements is considered the extra difference 
contributed by any other measurement independent of the set already 
considered is negligibly small. In such situations the equality of the 
dispersion matrix in both series is sufficient to ensure the equality of 
the regression functions as a whole. A good deal of caution is necessary 


AN ASSIGNED REGRESSION FUNCTION 115 


when the prediction formula based on one or two variables is so used. 
Such a statistical adventure undertaken by Karl Pearson in predicting 
the stature of prehistoric men is, however, justifiable if we agree with 
the last statement of his article (K. Pearson, 1898). 

“No scientific investigation is final; it merely represents the most 
probable conclusion which can be drawn from the data at the disposal 
of the writer. A wider range of facts, or more refined analysis, experi- 
ment, and observation will lead to new formulae and new theories. 
This is the essence of scientific progress.” 


3f.4 The Test for an Assigned Regression Function 

In 3f.2 it was assumed that the regression function for log capacity 
is linear in the logarithms of length, breadth, and height. If, at least, 
for some given sets of values of the independent variables multiple 
observations on the dependent variable have been observed, the validity 
of such an assumption can be tested. 

In Table 3f.4a@ are given the mean values of nasal index of people 
living in various parts of India together with the mean annual tempera- 
ture and relative humidity of the places. 


asin 3f.4a. Nasal Index of the Inhabitants and the Temperature 
and Humidity of the Region 


Relati 
Nasal Index Temperature Humidity 
Sample 
Region Size i 3 
(n) Mean Mean 
Mean Total Rania Pa Annual Toa 
a) (ng) © u @) 

Assam 36 | 83.0 2,988.0 | 72.6 2,613.6 | 85 3,060 
Orissa 40 | 80.4 3,216.0 | 80.3 3,212.0 | 69 2,760 
Bihar 30 | 80.1 2,403.0 | 74.8 2,244.0 | 88 2,640 
Malabar 45 | 77.0 3,465.0 | 80.2 3,609.0 | 81 3,645 
Bombay 26 «| 76.2 1,981.2 | 77.6 2,017.6 | 66 1,716 
Madras 35 | 75.9 2,656.5 | 81.8 2,863.0 | 76 2,660 
Punjab 28 | 71.4 1,999.2 | 76.4 2,139.2 | 63 1,764 
United R 

Province 32 80.8 2,585.6 | 77.2 2,470.4 | 69 2,208 
Andhra 41 | 76.8 3,148.8 | 80.3 3,292.3 | 69 2,829 
Ceylon 31 | 80.3 2,490.0 | 80.2 2,486.2 | 82 2,542 

Total 344 26,933.3 26,947.3 25,824 

Mean TE 78.335 75.070 


116 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


The corrected total sum of squares for nasal index has been found to be 
11,140.209 with (344 — 1) degrees of freedom. 

If no assumption is made about the regression of nasal index on 
temperature and relative humidity, the analysis of Yariance between 
and within groups is obtained as in Table 3£.48. 


TABLE 3f.48. Analysis of Variance for Nasal Index 


D.F. S.S. MS. 
Anca. Cg 
Between groups 9 Eni) — Th 2ngj = 3,169,900 352.21 
Within groups 334 x 7,970.309 23.863 
Total 343 11,140.209 


* Obtained by subtraction. 


The mean square for between groups is very large, indicating real 
differences in nasal index. Can these differences be explained by a 


regression of nasal index (y) on temperature (® and relative humidity 
(h) of the form 


Y = a+ Bit + Boh 

The normal equations leading to the estimates bi, be of By, 
Qi = bySi1 + b812 
Q2 = biSi2 + bSo9 


B2 are 


Eni) (Enj 
wineg = Zing) SLED La 1,072.40 
=n 
k Enh) (Ung 
Qo = X(ngh) — nh) ng) = 4,586.57 
En 
ia Eni) (Enh 
ee ea 2,334.91 
=n 
Dnt) (Int 
i= EAD 2,721.06 
En 
= (nh) (Eni) 
Sa = Dii) -E = 22,042.32 
N 


Solving the above equations, b; and bz are obtained as 


by = —0.237113 be = 0.182963 


AN ASSIGNED REGRESSION FUNCTION 117 


With these values the regression analysis can be set up as in Table 
3f.4y. 
TABLE 3f.4y. Regression Analysis 
D.F. S.S. MS. 


Due to regression 2 0;Q1 +b: = 1,093.45 546.72 
Residual about 
regression 341 = 10 046.759 29.463 


è Total 343 11,140.209 


* Obtained by subtraction. 


If the hypothesis concerning the regression is true, then the mean 
squares obtained from “Within groups” of Table 3f.48 and “Residual 
about regression” of Table 3f.4y will be of the same magnitude. A 
significant difference would disprove the hypothesis. 


TABLE 3f.45. Test for the Specified Regression Function 


D.F. S.S. M.S. F 
Deviation from specified 


regression 7 2,076.450 * 296.636 12.4 
Within groups 334 7,970.309 23.863 


Residual about regression 341 10,046 .759 


* Obtained by subtraction. 


The ratio 12.4 with 7 and 334 degrees of freedom is significant at the 
1% level, so that the regression of nasal index on temperature and 
relative humidity cannot be considered linear. It is also seen from 
Table 3f.4y that the variance ratio 2.34 with 2 and 341 degrees of 
freedom is significant, but this does not mean that nasal index depends 
entirely on weather conditions of the place in which the individuals 
live. The observed differences may be more complex than can be 
explained by weather differences, or the nature of dependence on 
Weather may itself be very complicated. Unless such a relationship 
is discovered and found to fit well, some caution is necessary before 
concluding that the shape of the nose is determined by temperature, 


humidity, ete. 

In some cases, as in the distribution of heights of father and daughter, 
it may be desired to test whether the regression of one variable on the 
other is linear. For the purposes of such a test the range of the inde- 
pendent variable has to be divided into a suitable number of class 
intervals and the variance of the dependent variable analyzed into 


118 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


between and within classes. To get an estimate of within variation 
it is necessary that at least some of the classes contain more than one 
observation. The regression analysis can be done without the use of 
the class intervals, or if the data are already grouped the midpoint of 
the class interval is taken as the value of the independent variable for 
each observation of the dependent variable in that class. The final 
test can be carried out as in Table 3f.46. 


3g The General Problem of Least Squares 
with Two Sets of Parameters 


3g.1 Concomitant Variables 


Suppose that the growth rates of groups of animals receiving dif- 
ferent diets are to be compared. The observed differences in growth 
rates can be attributed to diet only if all the animals treated are similar 
in some observable aspects such as age, initial weight, parentage, etc., 
which influence the growth rate. In fact, if the groups of animals 
receiving different diet differ in these aspects, it is desirable to compare 
the growth rates after eliminating these differences. 

However, it may be noted that no bias is introduced in the experiment 
if the animals which might differ in these aspects are assigned at random 
to the groups to be treated differently. This procedure enhances the 
residual variation calculable from the differences in the growth rates 
of animals receiving the same treatment and thus decreases the effi- 
ciency of the experiment. 

If the magnitudes of these additional variables are known, it is pos- 
sible to eliminate the differences caused by them independently of the 
treatments both from within and between groups and to test for the 
pure effects of the treatments with greater efficiency. The compu- 
tational technique relating to this process is known as the adjustment 
for concomitant variation. 

For significant reduction in the residual variation it must be known 
that the effects under study are influenced by the concomitant variables. 
This is important in experimental studies where due consideration is 
to be given to the cost and time involved in recording the concomitant 
variables. This can be tested as shown in the example considered in 
3g.3. 

On the other hand, the concomitant variables chosen must not have 
been influenced by the treatments under consideration. Sometimes, in 
assessing the differences in yields of plants treated differently, con- 
comitant variables such as the number of branches or the quantity of 
straw are chosen. These will be valid only when variations in them 
produce corresponding variation in the yield of plants treated alike, 


ADJUSTMENT FOR CONCOMITANT VARIATION 119 


3g.2 Adjustment for Concomitant Variation 
Assuming the regression of y on the concomitant variables 2, ---, £ 
to be linear, the observational equations containing the parameters 71, 
++, Tm under consideration and the regression coefficients 61, +++, Bx 
can be written 


Ely) = anti fee Gimtm + Britis be + Betri * (8g.2.1) 


The normal equations giving the best estimates of parametric functions 
are 


QO = (aati Feet Grabit j=l, m (8g.2.2) 
and 
Pos = (@r'Xs)ti Fee + (x,:Xs)b1 +°°° s=1,-::,k (8g.2.3) 


where Q; = (a;+y) and Pos = (xs:y). To solve them let us construct 
the equations 


Q; = (arata see m (am ajin” j=l,» e,m 
and 
Qe = (ar ot ee Seok (am jim 
j=l, e,m E= k 


where Q;“ is the same function as Q; with the variable y replaced 


by the sth concomitant variable 2s. Multiplying the equations in 
(3g.2.2) by 4%, t2®, +++, fm‘? and subtracting their total from the 


first equation in (3g.2.3), we obtain 
Ejam Py Qi = ROO” 
= b (Pu — hPL — += tm P Qm?) + b2(Piz — OQ, — 


se tm PnP) +e 
= bi Ey, + boF 2 Jozi ‘+ bei 


Similarly, the equations 


Eo; = bi Bj + b2E;j2 pt bi. Bix j=2, k (8g.2.4) 

* The regression can be of the type Bifi + Bafa feet Bide ire i n Pe 
functions of the concomitant variables, in which case Ío: ++) fe wil be = : as 
Separate variables. Thus, if the regression is polynomial in one variable Bix ont 
+--+, the functions z, 2, +++ are considered as separate variables. 


120 — LINEAR ESTIMATION AND TESTS OF HYPOTIIESES 
are obtained. In the above equations 
Buz = Py — OQ — +++ tp Qn® = 
Pij — QO —--- mP Qn 


i,j =0, 1, oss, k 
and 
Py = Xi'Xj 


These are the residual sums of products, the residuals being obtained 
from the observational equations for any two variables with the same 
matrix of equations but different sets of parameters. Having obtained 
the values of bı, ---, by satisfying equations (3g.2.4), the solution for 
ti is given by 

ti = O — dit — oe byt, 


This completes the estimation of the parametric functions. The residual 
sum of squares for the observational equations (3g.2.1) is 


Zy? — 2jQ; — EbPos = (2y? — ZOO) — XbLos 
= Eo — Zb,Eos 


which is a function of the residual sum of squares and products. 
There are two types of hypotheses to be tested in the above problem. 
Do the concomitant variables increase the efficiency of comparisons? 


The hypothesis to be tested is Bı =---= 6, =0. If this is true the 
residual sum of squares is Eoo, which differs from the pure residual 
Eoo — biBo, —-++ by biEor + bəEo2 +--+ which has k degrees of 


freedom. The mean square of this is compared with the mean square 
for the pure residual. 

Is there any additional advantage in considering xy in conjunction with 
Ti, ***, Zk—ı? The hypothesis to be tested is f = 0. Omitting by 
in the equations (3g.2.4), let the solutions be by’, +++, b'h—i so that the 
residual sum of squares is 


Eo — bi'Eo —-+-— b'r—iEor— 


The pure residual is Foo — bj Ho: — - - -— b, Hoy. Their difference with 
1 degree of freedom supplies the valid sum of squares for testing the 
above hypothesis. Similarly, we can test whether two or more variables 
are useful in conjunction with others. 

If the hypothesis to be tested is specified by some linear restrictions 
on 71, T2, ***, then the residual sum of squares has to be obtained 
subject to these restrictions. If FE; represents the residual sum of 


AN ILLUSTRATIVE EXAMPLE 121 
products under these restrictions, then the residual sum of squares is 
Eo — bi” Eoi’ — +++ — br” Eos! 

where b”, «++, bg” are the solutions of 
Eo = by" En’ +++ ++ br” Ew 
Eor = b” Egy! es + br” Err 


The sum of squares for testing the above hypothesis is 


Eo — bi” Eoi — +++ — br" Eog’ — pure residual sum of squares 


The degrees of freedom will be equal to the degrees of freedom of the 
hypothesis. The mean square corresponding to this can be tested 
against the pure residual. This completes the formal theory. The 
method is further explained in the illustration considered in 3g.3. 


3g.3 An Illustrative Example 
The following data relate to the initial weights and the growth rates 
of 30 pigs classified according to pen, sex, and type of food given. 


Taste 3g.3e. Data for Analysis (Wishart, 1938). 


Initial Growth Rate in 


Treat- Weight Pounds per Week 
Pen ment Sex (w) (9) 

A G 48 9.94 

B G 48 10.00 

Cc G 48 9.75 

I C H 48 9.11 
B H 39 8.51 

A H 38 9.52 

B G 32 9.24 

Cc G 28 8.66 

A G 32 9.48 

m C H 37 8.50 
A H 35 8.21 

B H 38 9.95 

Cc G 33 7.63 

A G 35 9.32 

B G 41 9.34 

m B H 46 8.43 
cC H 42 8.90 

A H 41 9.32 


122 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


TABLE 3g.3æ. Data for Analysis—Continued 
Initial Growth Rate in 


Treat- Weight Pounds per Week 
Pen ment Sex (w) (g) 
Cc G 50 10.37 
A H 48 10.56 
B G 46 9.68 
ay: AG 46 10.98 
B H 40 8.86 
Cc H 42 9.51 
B G 37 9.67 
A G 32 8.82 
y C G 30 8.57 
B H 40 9.20 
Cc H 40 8.76 
A H 43 10.42 


The problem is to study the effect of food after eliminating the initial 
weight. 

The first step in the analysis is to analyze the sum of squares of both 
the dependent and independent variables and also the sum of products. 
The analysis of the sum of products is done in the same manner as the 
analysis of the sum of squares by adopting the rule that the square of 


a variable in the latter is replaced by the product of the variables in- 
volved in the former. 


The total sums of squares and products are 
Zg? — 92g = 16.6068 
Zwg — 2g = 78.979 | 29 DF. 
Zw? — 2w = 1108.70 


If w; and g; denote the totals of 6 observations for the ith pen, then the 
sums of squares and products for pens are 


ll 


$292 — 92g = 4.8518 
6290; — JBw = 39.905} 4 D.F. 

2w? — ww = 605.87 
Similarly, 
obtained. 


If wij and gij denote the total of 5 
food and jth sex, then the sums of sq 


the sums of squares and products for food and sex can be 


observations for the ith type of 
uares and products for the joint 


AN ILLUSTRATIVE EXAMPLE 123 
effects of food and sex are 

183g: — gBg = 3.2422 
9 —0.885} 5 D.F. 


425u,;? — 2w = 59.90 


ale 
M 
Q] 
= 
% 
Qi 
b 
=} 
ll 


If from these the corresponding expressions due to food (2 D.F.) and 
sex (1 D.F.) are subtracted, the expressions for food X sex interaction 
are obtained. Table 3g.38 gives the whole analysis. The error line 


is obtained by subtraction from the total. 


Taste 3g.38. Analysis of Variance and Covariance 


2 


D.F. g wg w? 

Pen 4 4.9607 40.324 605.87 
Food 2 2.3242 — 0.171 5.40 
Sex 1 0.4538 — 4.813 32.03 
Food X Sex 2 0.4642 4.099 22.47 
Error 20 8.4039 39.540 442.93 
= Ew = En = En 

Total 29 16.6068 78.979 1108.70 


There is only one regression constant to be estimated. 
Enb = Eo 
442.93b = 39.540 


b = 0.089269 
Tape 3g.3y. Test for Regression from Error Line Above 
D.F. S.S. M.S. F 
Regression 1 bEo = 3.5297 3.5297 13.76 
Residual 19 Eno — bEo = 4.8742 0.2565 
L n 
Total 20 Ew = 8.4039 


The ratio is significant at the 1% level so that the comparisons can be 
made more efficient by eliminating the concomitant variations. 

If the hypothesis specifies that there are no differences in food, then 
the residual sums of squares and products are obtained by adding the 


rows corresponding to food and error: 


Hog’ = 10.7281 Zoi = 39.369 En’ = 448.33 


124 LINEAR. ESTIMATION AND TESTS OF HYPOTHESES 
The new regression coefficient is 
b" En’ = Eo’ 
b” = 0.087813 
The residual sum of squares when the hypothesis is true is 


Eo — 0” Ey’ = 7.2710 with 21 D.F. 


TABLE 3g.3ô. Test for Differences in Food, Eliminating 
the Effects of Initial Weight 


D.F. S.S. M.S. r 
Food 2 * = 2.3968' 1.1984 4.67 
Residual 19 Boo — bEo = 4.8742 0.2565 


Food + Error 21 Ew — b" Eo’ = 7.2710 
* Obtained by subtraction. 


The ratio is significant at the 5% level. To test for food without adjust- 
ment for concomitant variation, we have to construct the ratio with 
2 and 20 degrees of freedom. 


2.3242 20 
ji (A an 
2 8.4039 


which is not significant at the 5% level. The quantities used above 
are taken from the analysis of variance table (Table 32.38). The 
differences caused by food could be detected when the concomitant 
variation is eliminated. Similarly, any other effect such as sex or 
interaction can be tested. 


3g.4 A Problem of Inheritance in Man 

The methods of analysis of variance and regression are of great value 
in studying the problems of inheritance. Some aspects of Boas’ data * 
analyzed by Fisher and Gray (1937), are given in this section for illus- 
trating the methods. 

The data consist of measurements on Sicilian children and some of 
their parents. The first step in problems of this nature is to obtain 
the measurements on children corrected for age. The measurement (m) 
on any character can be represented by 


m=f;+bA+E 


*The data are published in Materials for the study in inheritan 
Columbia University Press, 1928). ce (New York, 


A PROBLEM OF INHERITANCE IN MAN 1235 


where f; is a constant for the ith fraternity, A g i 

the regression coefficient, and E the deest Apen e : 
the above formulation the covariance between A and m can be anal or 
into between and within fraternities. If the latter is denoted b; n 
and the sum of squares for A within fraternities by F11, then ie best 
estimate of b is Zo:/Ey1. The measurement corrected for age will be 
y = m — AE /Fu. 

In the above data the measurements of all the children were used in 
calculating the regression coefficient. For studying the problems of 
inheritance only 752 children whose parents had been also measured 
were considered. The sum of the corrected statures for these children 
is 1505, and their sum of squares 3,304,643 mm?. The total corrected 
sum of squares for 751 degrees of freedom is 


(1505)? 
= 3,304,643 — 3012 


3,304,643 — 


= 3,301,631 mm? 


The children belonged to 337 different fraternities and 285 different 
combinations of parental heights. The sum of squares between fra- ' 
ternities with 336 degrees of freedom can be split up into between 
fraternities with the same parental statures (102 D.F.) and between 
combinations of parental heights (234 D.F.). The sum of squares 
between fraternities is obtained by the usual formula for between 
groups. To derive the sum of squares between 235 different combinations 
of parental heights, all the children belonging to any combination are 
considered as forming a group and the between group sum of squares is 
calculated. The sum of squares between fraternities of the same parental 
heights is obtained by subtracting the latter sum of squares from the 


former. - 
The analysis of variance is given m Table 3g.4a. 
Analysis of Variance, Stature 


DF. 8.5. M.S. 
415 1,272,150 3065 


TABLE 3g.4a. 


Within fraternities 
Between fraternities with the 
same parental height 
Between combinations of pa- 
rental height 234 1,623,224 
751 3,301,631 


102 406,257 3983 


Total 
en fraternities with the same parental 


etwe' Lee 
than that for within fra- 


The mean square for b oD 
not significantly, 


height is greater, though 


126 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


ternities, indicating that parents of the same height were not identical 
genetically. Whether the former mean square is significantly different 
from the latter or not, it supplies the valid estimate of error for testing 
any hypothesis concerning the regression of child’s stature on those of 
parents. 

In order to test not only for linear regression on the two parents 
independently but also for theoretically possible deviations due to bias 
in dominance, the formula chosen was of the form 


Y = a + bızı + bore + bzta 


where z; and zz stand for the heights of the father and mother and z3 
for the product of 22%». 

The method of solving for a, bı, be, bz is the same as that considered 
in detail in 3f.2. The matrix (C;;) giving the variances and covariances 
of the estimates bı, b2, ba, written in millionths, is 


0.384227 —0.041917 —0.0074896 
—0.041917 0.438083 —0.0052768 
—0.0074896 —0.0052768 0.0117165 


The values of the coefficients are 
bı = 0.02618850 be = 0.8420271 bs = 0.0008596738 


The sum of squares due to regression (3 D.F.) is 526,452. To test the 
adequacy of the regression formula chosen, the mean square for devia- 


tion from regression is to be compared with the mean square for between 
fraternities of like parents. 


TABLE 3g.48. Test for the Adequacy of Regression 


DF. S.S. MS. F 
Regression 3 526 ,452 
Deviation from regression 231 1,096 ,772 4748 1.2057 


Between combinations of 


parental heights 234 1,623,224 
Between fraternities of 
like parents 102 406,257 3983 


The ratio for deviation from regression is not significant so that there 
is no evidence against the inadequacy of the regression formula. The 
mean square for deviation from regression supplies the valid estimate of 
error for testing any hypothesis concerning the regression coefficients. 


REFERENCES 127 


The sum of squares for 63 alone is x 
be? 
2 = 6308 
C33 
The sum of squares for regression (2 D.F.) when bs is not considered 
is obtained by subtracting this quantity from the total sum of squares 
due to regression (3 D.F.). The test for the regression coefficients is 


given in Table 3g.4y. 


Tests for the Regression Coefficients 


D.F. S.S. MS. F 
1 6,308 6,308 1.3286 
2 520,144 260,072 5.4775 


—=—$—_<{_ 


TABLE 3g.4y- 


Due to b3 
Linear regression 


3 526,452 


Regression 
1,096,772 4,748 


Deviation from regression 231 


The ratio for linear regression is significant. Since b3 is positive, 
there is an indication of negative bias in dominance, i.e., “a situation in 
which the heterozygotes more nearly, or more frequently, resemble the 
smaller rather than the larger of the corresponding homozygotes.” The 
ratio for bs, though greater than unity, is not large enough to establish 


the existence of negative bias in dominance. 
The actual regression formula is derived as 


Y = 1.684225 + 0.2690299(#1 — #1) + 0.3483146 (a2 — 72) 
-+ 0,0008596738(x1 — 41)(t2 — €2) * 


The sex difference in selection is 
bg bi = (& — %1)bs = 0.0792847 


9 so that the difference in regression in favor 


with standard error 0.0655 the 
e, is not significant. 


of the mother, though larg 


References 
Arren, A. C. (1935). On least squares and linear combination of observations. 
, A.C. i 

Proc. Roy. Soc. Edin., 55, 42. y 

a K Pas. Statistical methods for research workers. Oliver & Boyd. 
Tenth edition. 

Fisuer, R. A. (1924). Ona 
known statistics. Proc. Int. 


* The coefficients bı or bz ar 
mean in the product term. 


rror functions of several well- 


distribution yielding the e 
p. 805. 


Math. Congress, Toronto, 


e changed because of using the deviations from the 


128 LINEAR ESTIMATION AND TESTS OF HYPOTHESES 


Fisuer, R. A., and H. Gray (1937). Inheritance in man: Boas’ data studied by the 
method of analysis of variance. Ann. Eugen. London, 8, 74. 

Hooge, B. G. E. (1926). A third study of the English skull with special reference 
to the Farringdon Street crania. Biom., 18, 1. 

Koxopzieczyx, St. (1935). On an important class of statistical hypotheses. Biom., 
27, 161. a 

MacponELL, W. R. (1904). A study of the variation and the correlation of the 
human skull with special reference to English crania. Biom., 3, 19. 

Marxorr, A. A. (1904). Calculus of probability. Russian Edition. 

Minter, A. H. (1936). A study of the long bones of the arms and legs in man 
with special reference to Anglo-Saxon skeletons. Biom., 28, 258. 

Pearson, K. (1898). Mathematical contributions to the theory of evolution. V. On 
reconstruction of stature of prehistoric men. Philos. Trans. Roy. Soc., A, 192, 169. 

Rao, C. R. (1945). Generalisation of Markoff’s theorem and tests of lincar hypothe- 
ses. Sankhyd, 7, 9. 

Rao, C. R. (1945). Markoff’s theorem with linear restrictions on parameters. 
Sankhyd, 7, 16. 

Rao, C. R. (1946). On the linear combination of observations and the general 
theory of least squares. Sankhyd, 7, 237. 

“STUDENT” (W. S. Gosser) (1908). On the probable error of a mean. Biom., 6, 1. 


Wisuarr, J. (1938). Growth rate determination in nutrition studies with the bacon 
pig and their analysis. Biom., 30, 16. 


CHAPTER 4 


The General Theory of Estimation 
and the Method of Maximum 
Likelihood 


4a Best Unbiased Estimates 


4a.1 Estimation by Minimizing the Variance 

In Chapter 3 problems were considered in which the class of estimates 
of parametric functions was restricted to linear functions of observa- 
tions only. Nothing, however, was assumed about the actual distribu- 
tion functions of these variables, except that their expectations are 
linear functions of unknown parameters and have a common unknown 
variance. It is of interest to examine the methods of obtaining the 
estimates with the minimum possible variance by considering the total- 
ity of unbiased estimates. This method of estimation is not necessarily 
the best. The general problem is that of deriving a function t = f(y, 
“++, Yn) of the observations Yı, ***» Yn such that with respect to any 
other function ¢’ the probabilities satisfy the relationship 


P@-™M sicnt i £PO—m <t < It M) (4a.1.1) 


for all possible à; and ^z in an interval (0, à). The choice of the interval 
may be fixed by other considerations, depending on the frequency and 
the magnitude of departure of the estimate from the true value allow- 
able in a problem. If the condition (4a.1.1) is satisfied for all à, then a 
hecessary condition is 
Ba- 0)? > Be — 9° 

where Æ stands for expectation. If, further, it is assumed that the 
estimate should be unbiased, then it follows that 


vi > VE) 


Where V stands for variance. r 
1 


130 ESTIMATION AND MAXIMUM LIKELIHOOD 


As no simple solution satisfying the postulate (4a.1.1) exists, the in- 
evitable arbitrariness of the postulates of unbiasedness and minimum 
variance needs no emphasis. The only justification for choosing the 
estimate with the minimum possible variance from the class of un- 
biased estimates is that a necessary condition for (4a.1.1) to hold for 
all A, with the further requirement E(t) = 6, is ensured. The condition 
of unbiasedness is particularly defective in that many biased estimates 
with smaller variances lose their claim as estimating functions. There 
are, however, numerous examples where a slightly biased estimate is 
preferred to an unbiased estimate with a greater variance. Until a 
unified solution of the problem of estimation is set forth, an estimating 
function has to be subjected to a critical examination as to its bias, 
variance, and frequency for a given amount of departure from the true 
value before utilizing it. The theory of confidence intervals as developed 
by J. Neyman is a great advance toward such a unified theory. 

There is one important aspect which favors unbiased estimates. In 
biometiic investigations it is often necessary to combine the evidence 
supplied by various sources. Often the evidence is in the nature of an 
estimate, probably with a standard error attached to it. For instance, 
two geneticists may be determining the proportion of albino mice pro- 
duced under certain types of mating. One gives a proportion pı + «1, 
and the other po + e2. If the estimates are unbiased, then a combined 
unbiased estimate may be reached, and with the accumulation of more 
and more evidence the true value can be approached through a series 
of unbiased estimates. On the other hand, if biased estimates are pub- 
lished without any indication as to the nature of the bias involved, nothing 
definite can be said about the combined estimate. The bias may exceed 


the standard error at some stage, and ultimately the combined estimate 
may not be near the true value. 


4a.2 The Information Limit to Variance: A Single Parameter 


Let $(21, +++, 2n,0) be the probability density of the observations Tiy 
“+, Tn, and (z1, +++, £n) be an unbiased estimate of ¥(0), a function of 
6, the parameter occurring in the probability density. Then 


fi to dv = ¥(0) (4a.2.1) 


where dv stands for the product of differential elements dx, 


++, dtp 
and a single integral sign stands for the multiple integral. It stig be 
noted that when the variables are discrete the integral sign can be re- 


placed by the summation symbol. If the limits of integration do not 


THE INFORMATION LIMIT TO VARIANCE 131 


involve 6, then differentiation under the integral sign yields 


if the above integral exists, which shows that the covariance between t 
and (1/¢)(d¢/d6) is (dp/d8). Since the square of the covariance is not 
greater than the product of the variances of the two variables, the fol- 
lowing relationships (using V and C for variance and covariance) are 


true. 
8) a fofa) 
7 Fi mime < (ef (s a 
ae cE eS aw 
dy/do)” 
VO ( L ) (4a.2.2) 
where 


1 dọ | @ log ‘| 
rep € A Ta de 
The quantity J is the information on 0 supplied by a sample of n obser- 
vations as defined by R. A. Fisher. This gives the following theorem. 


Theorem. The variance of any unbiased estimate of 4), a function 
of the unknown parameter 9, is not less than [y’(@)?/Z. , which is defined 
independently of any method of estimation. The conditions to be satis- 
fied are that the range of the stochastic variable is independent of 0 
and the probability density admits differentiation under the integral 


sign, 
If ¥(@) = 0, then this li 


mation limit to variance for the est a ee 
ever, the minimum attainable in any particular distribution. An un- 


biased estimate of y(@) with the minimum attainable variance will be 
called a best unbiased estimate. It is incorrect to conclude that an esti- 
mate is inefficient if its variance js not equal to [Y (0) /I ea = is 
ascertained that this minimum is attainable. In fact, it is shown a 
that there exists a more exact expression for the minimum possible 


Variance. 
As a corollary to the 
value of 


mit is 1/7. We shall call [y’/(6)|?/I the infor- 
imation of ¥(@). This is not, how- 


above theorem, it follows that the minimum 


ET — 8)” 


for the sot of statisties T with expectation ¥(0) is 
_ vor 
p-vOP+ TO 


132 ESTIMATION AND MAXIMUM LIKELIHOOD 
Thus if b(@) is the bias in the estimate T of 0, then 


¥@) = 0 + b@) 
in which case 
sr — ay a, O+OF 
E(T — 6)? > {b(6)}? + TO 


4a.3 Distributions Admitting Estimates with the Information 
Limit to Variance 


The relationship (4.2.2) may be written as 


fe- worsw xfs) av « fft- vo 4 a} 


It is known from Schwarz’s inequality that the equality is attained only 
when 


1 do 
t— y0) =rx-— 
¥(0) aa 


where A is a constant depending only on 0. The solution of this differ- 
ential equation is 


() 
logg = fE- FO w= A+ 10, + 0, 


where ©; and Oz are functions of 0, and A is independent of 6. Hence 


$ = ġı exp (10; + Os) (40.3.1) 


where ¢; is only a function of the observations. Also if 


$ = $1 exp (0; + 02) 
then 


fe dv = fo exp (40; + @2) dv = 1 


se fe exp (401) dv = exp (— 02) 


On differentiating with respect to ©, twice, 


do. 
fou exp (£01) dv = — — exp (— ©.) 
de, 


Os /d02\? 
foe exp (401) dv = [- do? + S) | exp (— 02) 


DISTRIBUTIONS WITH INFORMATION LIMIT TO VARIANCE 133 


Hence 


2 


ag = t ee 
p = eS an a = 
dQ; de,” 


(42.3.2) 
The information limit for the estimation of —d@2/d®, is 
| d =i ; | 20. Cr | doz 
dodo) ` do,’ \ do do,” 


which is the same as that derived in (4a.3.2) so that the information 


limit is attainable. 

Hence the necessary and sufficient condition that a distribution ad- 
mits the estimation of a suitably chosen function of the parameter with 
variance equal to the information limit is that 


$ = ¢1 exp (t01 + 02) 


where @; and ¢ are functions of the observations only and ©; and Oz are 
functions of @ only. The parametric function to be estimated is 


dO doz dé 


de, dd 


and the variance of the estimate is 
(G) alm) 
TIo \d/ d\ do 


For any estimate ¢ which has the minimum variance (4.2.2), 


EEE 

t— y0) = a T 
w0] 
vi- wol = xI = 


I?, which is a unique function of 0. This shows 


biased estimate of (4). f 
be n independent observations from the 


so that X? = [y’'(@)]’/ 
that ¢ is unique as the best un 
Example 1. Let t1, ***, Tn 


normal population 1 
gr erwin? dx 


oV 20 


134 ESTIMATION AND MAXIMUM LIKELIHOOD 
If nī = tı +--+ Tn, then 


$= 1 eT RE HP + Ei 71/20? 
(ev 2r)” 
= oye" —np*)/20? 
where 
= Ex;?/207 


1 
Mee UNV" 


which is independent of u. Thus ¢ can be expressed in the form (4a.3.1), 
which means that 7 is the estimate of 


| d(=np?/20?) _ 
dmy) 4 


and has the minimum variance (4a.2.2). 
Example 2. Consider n independent observations from the population 


P 


T(p) 


mPp 


t= Tor 


= pje Panta log a 


e 1P dgr 


eT C 


where ¢; is independent of œ and y = (22;)/pn, so that the minimum 
variance (4a.2.2) is attained for the estimate y of 


d(np log a) _ d(— pan) 1 
da ` da a 


d =: a i 1 
V) = | qa pan)| T (-) a 


$ — pet P—D2-2 log T(p) 


Also 


where ¢; is independent of p and z = (1/n) log (Iz), so that the mini- 
mum variance (4a.2.2) is attained for the estimate z of 


dlog T(p) | d(p—1) _ dlog T(p) 
dp dp dp 


Example 3. The chance of obtaining r successes in n independent 
trials with probability r for success in a trial is 


SUF I 
FICIENT STATISTICS AND UNBIASED ESTIMATES 135 


n 
oe (") r= =)" 


dlog _ r— nr 


dr a(l 7) 
V(r — nm) B n 


=P- (l-7) 


u(2\=5 and v(2)- P= 
n n T 
This show 
(40.2.9), ws that r/n as the estimate of 7 has the minimum variance 
Exampl 
_“rample 4. The ch i 
distributori e chance of n indepen 


dent observations from a Poisson 


Tı In 
gagh E 
tı Tn! 
1a 
26 at ge 
1 dọ 1 ni n 
r=v(-2)=376 pee t tn es 
edu) P Ga ) won 


| REEN 
estimate @ of u has the variance u/n = 1/I so that the minimum 


Vari 
Tiance (40.2.2) is attained. 


4a4 ; 
Sufficient Statistics and Unbiased Estimates 
the parameter 0 if, with respect 


A Be PAA 
to oe T is said to be sufficient for 
and i other statistic 7’, the joint probability density P(T, T’) of T 
is of the form 

P(T, T’) = P(T, 0) P2 
density fo 
n T, is independe: 
e necessary and sufficient co 


(r |T) 
r T, and Pa(T' |T), the 
nt of 8. From 


wh 
ere P (T, 6) is the probability 
ndi- 


relati 

Gea probability density of T give 

tion oo it can be shown that th 
or ¢ to admit a sufficient statistic is 


= p(T, Oey 7" Tn) 


ity density given 
ic Tissai 


(4a.4.1) 


Ww 
here ¢, as a relative probabil re ree oe a, 
to be sufficient tor ?. 


and T; : 
Tis a function of z only. The statist 


136 ESTIMATION AND MAXIMUM LIKELIHOOD 


In fact, any function of T having one-to-one correspondence * with T 
will be sufficient for 6. 


Theorem. If an unbiased estimate and a sufficient statistic exist for 
¥(6), the best unbiased estimate of ¥(6) is an explicit function of the 
sufficient statistic. 


The statement does not imply that there exists a function of the 
sufficient statistic which attains the minimum variance (4a.2.2). But 
the best unbiased estimates have to be sought from the functions of the 
sufficient statistic only. 

If t is unbiased for ¥(6), then 


¥(0) = f ip dv = f t6(T, 0); dv 


Integration over the surfaces of the constant T gives 


y0) = f S(L)®(L, 0) dT (4a.4.2) 


which shows that there exists a function f(T), of the sufficient statistic 
T, which is unbiased for ¥(@). Also 


J- vere ao = fu-sy"9 @ + JUD - vorar, o ar 


since the product term vanishes in virtue of (4a.4.2). Hence 


fe—vore a > fun -vorecr,» ar 


i.e., 


V@® > VIT) 


which proves the theorem. Thus, when a sufficient statistic exists, we 
need only search for the best estimates among functions of the sufficient 
statistic. If there exists a unique function of T unbiased for ¥(0), then 
this is necessarily the best. But, if more than one function of T is un- 
biased for ¥(@), then the one with the least variance has to be chosen. 
This leads to the corollary. 


Corollary. If a function F(T) of the sufficient statistic T is unbiased 
for y(0) and is also unique, then this is the best unbiased estimate. 


* It can be verified that Z is a sufficient statistic for u, the mean of a normal dis- 
tribution. But 2 is not sufficient for » for the condition (4a-4-1) is no longer true. 
The statement, generally made, that any function of a sufficient statistic is also 
sufficient is not true. 


DISTRIBUTIONS ADMITTING SUFFICIENT STATISTICS 137 


Example 1. The minimum variance unbiased estimate is unique. 
{Hint: If T} and T are two such estimates, then V[(T; + T2)/2] > 
V(Tı). This shows that the correlation between T; and Ts is unity.} 

Example 2. If Tı and T2 are two unbiased estimates of a parameter 
with variances c1”, oo” and correlation p, what is the best unbiased 
linear combination of Tı and To and what is the variance of such a 
compound? 

We have to minimize 


l?a? + 2phlo0102 + l0? 
subject to the condition lı + l2 = 1. This gives, in one step, 
l(o1? — po102) = lalo? — pores) 


giving the ratio between lı and lə. 

Example 3. Suppose that T, in the above example is an unbiased 
minimum variance estimate and To any other unbiased estimate with 
variance o”/e where V(Ti) = a°. Then the correlation between Tı 
and To is v/e. 

The best linear compound of Tı, Tg has the coefficients lı and lə 
satisfying the condition 


Lle — pVe) = (l — pve) 


Express the condition that the variance of the best compound is not 
less than o? or simply, since 7; cannot be improved upon, lz = 0, which 
means e = pv/e. 

Example 4. If Tı, To are two unbiased statistics having the same 
variance, then their correlation is > 2e — 1, where e is the ratio of the 
variance of the best estimate to the common variance of T; and To. 

[Hint: Consider the statistic T = (Tı + T2)/2 and express the condi- 
tion V(T) > the least variance.] 


4a.5 Distributions Admitting Sufficient Statistics 
The necessary and sufficient condition that a distribution admits a 


sufficient statistic is that the probability density can be written in the 


form 
$ = B(T, 6)$1(t1, +++) Tn) (40.5.1) 
where &(7,, 0) is the density of the statistic T and ¢1 (21 T £n), the 
density of the sample given T, is independent of 0. tt is obviously not 
enough to state the necessary and sufficient condition as the factoriz- 


ability of ọ into #(T, 0) and hilti, ***, tn), Where T is a function of 24, 
<- a, and ġ is independent of 8 unless the range of x is independent 


138 ESTIMATION AND MAXIMUM LIKELIHOOD 


of 6. For instance consider 


0 $ O(z1+-+++zn) 
=| i e (4a.5.2) 
Fa 


where the range of each x is from 0 to 8. This does not admit a sufficient 
statistic. 

Let us now assume that 2, ---, x, are independent observations from 
the same distribution so that 


$ = p(z, 9) - -+ p(an, 0) 


If this is factorizable into (T, 0) and ¢ı(tı, ++, 2n) then, assuming 
that the functions are partially differentiable with respect to 0, we 
obtain 


5 ð log p(x, 0) _ 9 log &(T, 0) 


= GT, 6 40.5.3 
= an (T, 0) (4a.5.3) 


Since this holds for all 0, any value of 0 can be substituted in (4a.5.3) 
to obtain the relation 


u = Zu(zi) = g(T) 


connecting T and the statistic u = Zu(x;). If g(T) and u(x) are dif- 
ferentiable functions, it follows that 


Ox; dz; aT Ox; 
Also from (42.5.3) 
e OG(T,0) dT a? log p(z; 0) 
oT dx;  30ðr 


Therefore, for all i 


a? log p(z: ) | dule) _ G(T, 8) _ dg(T) 
bəri ` dru aT ` aT 


= M (0) a function of 0 only 
Integrating with respect to T, 
G(T, 6) = 4 @)g(T) + r2(6) 
and then with respect to 6, 
log $(@1, +++, En) = O19(T) + O2 + Alay, +++, tn) 


(21, ++ Sn) = Geil?) +02 (40.5.4) 


AN OPTIMUM PROPERTY OF SUFFICIENT STATISTICS 139 


where g(T), ¢2 are functions of £1, ** +, tn only, and O1, O2 are functions 
of @ only. This is obviously not a necessary and sufficient condition 
because the only condition used is the factorizability of ¢. In fact the 
illustration in (4a.5.2) has the same form as (40.5.4). But when a 
sufficient statistic exists, the distribution must necessarily be of the 
form (4a.5.4). 


4a.6 An Optimum Property of Sufficient Statistics 
In 4a.5 it was shown that the distribution admitting a sufficient 
statistic is of the form 
$ = exp [f:0; + O2 + to] 
Let F(t) be any function of t with the expectation v(6). Consider an 
alternative function unbiased for (0) but differing from F(t) by f (ti). 
Then 


fio exp [4:01 + O2 + t2] dv = 0 
Continuous differentiation, when permissible, yields 


ferw exp [401+ 02+ 4)dv=0 forall k 
or 
fers) exp [4:01 + 92 + to] dv = 0 


From Fourier’s inversion theorem it is known that, if f(t) exp (4:61 
+ @ + t2) is continuous, then it must be zero almost everywhere. 
Since the second expression cannot be zero, it follows that f(t) = 0, 
which shows that F(t) is unique as an estimate of its expected value, 
and therefore the best possible. Hence the following theorem is obtained. 


of the sufficient statistic is the best estimate 


Theorem: Any function 2 tis 
the regularity conditions assumed above. 


of its expected value under 
ation of the properties discussed in exam- 


This i eneral demonstr i xt 
yer It is true under less stringent conditions 


ples 1 and 2 below. The resu 
than those assumed above. 


Example 1. Consider n independent observations from the normal 


population i s= n? 
exp | — z | 2 
oN 2r 20 
E = C] 
ġ = $1 exp 29? 


140 ESTIMATION AND MAXIMUM LIKELIHOOD 


where ¢; is independent of u and nē = x; +---+2,. This shows that 
@ is sufficient for x. In fact, we have seen that Z has the minimum 
variance (4a.2.2) as an estimate of u. Suppose that instead of u we are 
seeking the estimate of u’. Evidently 2 is sufficient for p? or for any 
function of u. The parameter o° being considered known, an unbiased 
estimate of p? is 


V(X) = Be) — (w+ ai 


£5 Gu?o? 30t Qo7n? ot 
ica Slat + aa =a 
n n n n 

4p?a? 20t 

n n? 


The minimum variance (4a.2.2) is 
(du?/du)?  4p?o? 
I on 
since I = n/o”, which is smaller than V(X). But X is the best un- 
biased estimate since, as shown below, it has the minimum attainable 
variance. To prove this it will be shown that there exists no function 
of ë, the sufficient statistic, such that its expectation is g? and its var- 


iance less than that of V(X). Let the alternative function differ from 
X by f(@). Since both are unbiased, it follows that 


2 <9 3 
= 7 —n(Z* — 2u3)] 
00 [Sr] fren [SS Jas =o 
Omitting the term before the integral and differentiating twice with 
respect to y, the following relation is obtained. 


he ~ [= 3? — 23) 
gt 


20? 


Jez =0 
This shows that cov {f(#), z} = 0. 


VIX +/@} = VX) + VE) +2 cov lro, -o z) 
= V(X) + Visa} 
> V(X) 


AN OPTIMUM PROPERTY OF SUFFICIENT STATISTICS 141 


which shows that X is better than any other estimate, or, in other words, 
it has minimum variance as an estimate of p”. 
Example 2. Consider n independent observations from the population 


eee 
—— ex? dx 
T(p) 
a?” 2 
$= eTa) 
{T(p)}” 


which shows that z is sufficient for a. Since E[(np — 1)/#] = a, the 
statistic (np — 1)/ is unbiased for æ. 


F — =i] 2 
v(” 5) 2 {2 i} __@ 
& np — 2 np — 2 


The minimum variance (4a.2.2) is 


which is smaller than V[(np — 1)/#]. Let an alternative estimate differ 
from (np — 1)/ by f(@). Then 


fro exp [— a3] dē = 0 for alla > 0 
which gives the result that 


fra exp [— a3] di = 0 for allr 


or 


fete exp [—az) dz = 0, where i = V —1 


From the Fourier inversion theorem we obtain 
f(E) exp [az] = 0 
or {(€) = 0 almost everywhere, so that the function (np — 1)/Z is 


unique as the estimate of a. 

Example 3. Consider n i 
a rectangular distribution in 
zy has the distribution 


ndependent observations 21, ***; %n from 
the range 0 to 6. The biggest observation 


np” tt ™ dze 


142 ESTIMATION AND MAXIMUM LIKELIHOOD 
so that 


Ee) = ——8 
we) a 


n+1 n+1 OORE -y ia 
z( n n)=e r( n n) =o 


Also, x» is sufficient for £; hence the unbiased minimum variance esti- 
mate must be a function of x, only. The statistic (1 + 1/n)zy is un- 
biased for £. It is also unique for, if another statistic differs from this 
by (zr), then 


or 


B 
J sena ary = 0 
0 


for all 8. This means that 
olz)” = 0 
or $(z) = 0 almost everywhere. The statistic (1 + 1/n)zy is the best 
unbiased estimate of £. 
Example 4. Consider a rectangular distribution in the range a to $. 


The biggest and smallest observations z, and x, form a sufficient set of 
statistics for a and 8. The joint distribution of zẹ and zs is 


n(n — 1)(8 — a)" (z — C * dxy dts 
The statistics 


and 


are unbiased for a and £, respectively. It can also be shown, as in 
example 3, that these are the best. 

Example 5. The unbiased minimum variance estimates of the central 
point (a + 8)/2 and the range (8 — a) are 


To F Ta n+1 
—_ and 
2 n-1 


(a — 2s) 


Example 6. Show that in example 3 any function t(x+) of x, unbiased 
for B has the terminal value 


ua) =(1+2)e 


MORE STRINGENT INEQUALITIES 143 


4a.7 More Stringent Inequalities for the Variance of an Estimate 
It was shown in 4a.2 that the minimum variance for an unbiased 
estimate of ¥(@) is not less than 


(AON 
- nd F I 
Considering the equality 
f to dv = (8) 
and differentiating k times 
do dy 
aa” ao 
which leads to the relationship 
d*y/do*)? 
vA <x ss Sela seal 
Jrk 
where, i 
1d*¢ 
Ja = (25) bes 1, 2,» 


Thus a chain of relationships can be derived of which the result obtained 
in 4a.2 is a special case. More generally, if 


id'¢ 1d¢ 
Jir = COV eae) 
2 ldo 1d*¢ 
then the square of the multiple correlation of ¢ on F a5 5 qo is 


a'y(6) aa 
kr | > V(t 
[> de® de” © 
where (J*") is the matrix inverse to Jer). Since this is not greater than 
unity, we obtain the relationship 
a'y(0) d'O) 
kr 
V(t) x SBI a ar a 
This result is due to Bhattacharya (1947). Since the multiple correla- 
tion obtained by considering all the variables of a group is greater than 
that for a subset, it follows that the lower bound to the variance of an 
estimate can be improved by the addition of more variables of the type 


(1/¢)(d*4/d6"). 


144 ESTIMATION AND MAXIMUM LIKELIHOOD 


4a.8 The Case of Several Parameters 

Minimal Set of Sufficient Statistics. A set of statistics T1, ---, T is 
said to be a minimal set of sufficient statistics if m is the smallest num- 
ber for which 


o(x| 0) = Py(T, +++, Tm | Pal | T) (49.8.1) 


where P, is the probability density of T4, ---, Tim, and Ps the probability 
of the observations given the statistics T;, --+, Tm, is independent of 0. 
There is no restriction on m, which may be greater than, equal to, or less 
than k, the number of parameters involved in ¢. 


Information Matrix: Let 
a? log p 


(e) . = 
= 80; 30; 


and E(i;) = Lig 

The matrix J = (I,;), (i,j, = 1, ---, k), is called the information matrix. 
If (Sij) denotes the information matrix obtained from the distribution 
P,(Z' | 6), then it follows from the definition given in (4a.8.1) that (Z;;) 
= (8,;). Also, if ¢(, 0) is the probability density corresponding to n 
independent sets of observations from the same population, then J;; 
= nJij, where J;; is the element of the information matrix correspond- 
ing to a single set of observations. This is the additive property of 
information. 

The following theorem may now be proved. 


Theorem. Lett, +-+, tp ber < k functionally independent statistics 
such that 


(i) E(t) = Wil, +++, O) a PEE 
(ii) Elti — W(t — t) = Va t 
then: 


(A) There exist functions M, ---, M, of the minimal set of sufficient 
statistics such that 


(a) E(M:;) = vil, +++, 0r); 
(b) if U = (Uiz) where Ui; = E(M; — :)(M; — y;) and V = (V;;), 
then the matrix (V — U) is positive definite or semi-definite. 


(B) If the ranges of integration do not involve the parameters, and 
I~, the inverse of I, exists, and A = (a¥;/08,), = 1, ---, 7,7 =1, 
+++, k), then the matrix V — AT TIA’, where A’ is the transpose of A, is 
positive definite or semi-definite. 


THE CASE OF SEVERAL PARAMETERS 145 


Proof. Since 
f tola | 0) dv = f P,(T |0) aT f t:Pa(a| T) do! 
2 f MAT)PAT | 0) aT (4a.8.2) 


it follows that M; is a function of T only such that 
EM) = Et) =y: t=1,2, 0°57 


This proves (A)(a) of the theorem. 
Consider Zl;t; where l; are arbitrary constants. 


E(2lit)) = Dla; = EM; 
frente — yaPoce| o ao 


Ż f Elti — M) + ZLM: — vi)e l 0) dv 
= frends — aartote | do + [ELAn — vorec| O a 


+ 2 SIELE: — VAIPAT 10) ar [EL — MPa | 2) ao 


By virtue of the result (4a.8.2), the last term vanishes identically, leav- 
ing only two positive quantities. If we retain only the latter, the follow- 
ing relationship is obtained. 
V(Èlt:) x VM) 

or 

SSUV K EZGU x BZV — Uyl £0 
This means that (V — U) is positive or semi-definite. This proves 
the result (A) (b) of the theorem. 


Since 
f tjela | 0) dv = Y; 


dp, Wy 
t;— dv = — 
7 db; dôi 


146 ESTIMATION AND MAXIMUM LIKELIHOOD 


so that by considering the dispersion matrix of t, ---, tp -—, 


» 


J Ld} we find that the partitioned matrix 
@ dô; ŝ 


is positive or semi-definite. * 
Consider the determinant 


é& {=A 
(E E 


where ô, is the unit square matrix of order r, which is always positive. 
The product 


ô, =s z 
À fama AT 


| V — Ara’: 0 
Ia’ i Ôr 
or | V — AIA’ | > 0. This result holds true even for a subset of the 
statistics tı, ---, t,, which means that the matrix V — AITA’ is posi- 
tive definite or semi-definite. This proves result (B) of the theorem. 
A series of corollaries can be obtained from results (A) and (B) of the 
above theorem. 


Corollary 1. By considering only the diagonal elements in V — AJ—!a’ 


OY; OY; 
Om IOn 


where I” are the elements of the matrix reciprocal to the information 
matrix (Inn). This shows that the variance of the estimate of y; is not 
Jess than a quantity which is defined independently of any method of 
estimation. This is the generalization to many parameters of the ex- 
pression derived in (4a.2.2). If y; = b = 1, ++, k), the relationship 
(44.8.3) reduces to 


Vu k DEI" (4a.8.3) 


Va xI” 


These are not necessarily the minima attainable. Observe that I“ is 
greater than 1/I;; which is the limit obtained in (4a.2.2) for the estimate 
of 6;. When the values of 01, ---, 0:1, 0:41, +++, Or are known, then 

* The matrix J is the information matrix, not to be misunderstood for the unit 


matrix introduced in Chapter 1. To distinguish this the unit matrix is here repre- 
sented by 6 which is also an accepted symbol for a diagonal matrix. 


THE CASE OF SEVERAL PARAMETERS 147 


the limit (4a.2.2) is applicable. If not, the estimate of 6; has to be 
independent of the above quantities and for this reason the limit is 
increased. 

Corollary 2. Since the matrix (V — U) is positive definite or semi- 
definite, it follows that Vi; « Us, (È= 1, +++, 7), which shows that 
estimates with the minimum attainable variances are explicit functions 
of sufficient statistics. 

Corollary 3. Since the matrix (V — AI~'A’) is positive definite or 
semi-definite, it follows that (see example 1 in 1¢.5) 


[V] «| arma] 


The quantity | y| is called the generalized variance of the estimates. 
The above result shows that this is not less than a quantity which is 
defined independently of any method of estimation. ae 

Corollary 4. Since (V — U) is positive definite or semi-definite, it 
follows that | V| «| U| (example 1 in 1c.5). This shows that the 
estimates with the minimum possible generalized variance are functions 
of the sufficient statistics. 


Corollary 5. If 
Syren OY; OY: 
Ve in On 


in which case the estimate of Y; has the minimum variance, then 


OY; OY; 

i Spr" ce ee 

Va Bm IOn 
f this best estimate with any estimate of any 
as a fixed value defined independently of 
This follows from the fact that the deter- 


So that the covariance 0 
other parametric function h 
any method of estimation. 


minant 
Wii y syr OW; OY; 
Vise DOP tee Ky= ZII TB 0s 
3 dbm IOn n 3n 
; OY; OY; OY; 
Vy — 22r"" aes Vj; — ==" oN 
i On IOn Om On 


=. i "O. 
which is a subdeterminant of |V- AI A' |, is not less than zero 


Example. Consider n independent observations from the normal 


Population no 
A] dx 


148 ESTIMATION AND MAXIMUM LIKELIHOOD 


It is easy to verify that the information matrix for » and g? is 


n 
2 0 
(Ipe?) — 
o a 
20% 
with its reciprocal 
a 
= 0 
pie’) = 
y 20% 
t ee 
n 


Since 7 as an estimate of u has the minimum possible variance, it follows 
that any estimate of g? has zero correlation with Z, since I” = 0. 
This result can be extended to the case of multivariate normal popula- 
tions where it can be shown that the means are uncorrelated with all 
possible estimates of the variances and covariances. 

As for the estimate of o°, let us consider 


g 2er— 2)? 
se = 
n—1 
B(s*) = o 
4 
V(s?) = 
(s*) i 


which shows that the minimum variance 2o*/n is not attained. But this 
is the minimum attainable, as shown below. If any estimate of o? dif- 
fers from 2 by f(s, 2), then 


fie &) exp [E i= ae) dv=0 


20? 


Twice differentiation with respect to u leads to the result 
E{(E — WPIS, 2)} =0 
Differentiation with respect to o? gives 
E{n — )"f(s, 2) + (n — Iss, 2)} = 0 


or 
cov [s”, f(s, @)] = 0 


Consider 


VI? + SE, 8)] = Vs") + VIG, 2) 


DISTRIBUTIONS ADMITTING SUFFICIENT STATISTICS 149 
which means that V(s?) is the least possible. Thus @ and s? are the 
best unbiased estimates of » and o°. 


4a.9 Properties of Distributions Admitting Sufficient 
Statistics: Several Parameters 
It has been shown by Koopman (1936) that under some general con- 
ditions the distribution function $(t1, ***, £r |01, +++, 0a) admitting a 
set of statistics Tı, ---, Ts sufficient for (1, +*+, 0, can be expressed in 
the form 


@ = exp [91X1 + 92X2 ++ ©; + 9 + X)] 


where X depends on T only and © on 9 only. Using the relation 


fow=1 


we find on suitable differentiations 


i 30 
(X) = 00; 
ae 20 
(X) = 302 
30 
cor (A am; 
v I 


The element J;; of the information matrix for the functions 0;' = 0:, 
G =1,-+-,@), is 
, 4), z( 32 log 3 30 
S Er fe 
00;/00;! 00;'00;" 


If (1#) is the matrix inverse to (I ij), then the minimum possible variance 
for an estimate of the parameter —00/20; is 


Spy" PO O L sy glin = Ii LI lin = Tae = VES) 


90,00; Orð; x 
so that the minimum variance (4a.8.3) is attainable for the estimates of 
—00/a0, (i= 1, -7,9 This shows that, when sufficient statistics 
t; mcs , a . t 5 
étal in aa to the unknown parameters exist, it is possible to find 
functions of parameters which admit estimates with the minimum 


variance (4a.8.3). 


150 ESTIMATION AND MAXIMUM LIKELIHOOD 


Example. For the example considered in 4a.8, ¢ can be expressed 


BOND 20” 20 prs 
nip (n— 1)? +n? ng | 
= ex — — n lo X 
PNP [ g? 20? 20? got 
2 
m 1 ny 
== eH o= — (nio o) 
1 E 2 202 E P 
00 nu ðO 00; ; 00 dO, _ 00 1 
du œ aðu «| «On On 30, 
or 
00 
aðr ~” 
Similarly, 
E E 
30, (u? + o°) 


The parametric functions u and (u? + o°) admit estimation with the 
minimum possible variance. Their estimates are @ and {(n — 1)s? 
+ n&*}/n. It is seen in the previous example that £ and $? are the best 
for » and o”. In general, it can be proved that any function of the 
sufficient set of statistics has the minimum attainable variance as an 
estimate of its expected value. 

This fundamental concept of sufficient statistics is due to R. A. Fisher, 
who recommended, as a first step in any methodological problem, the 
replacement of a sample by an exhaustive set of sufficient statistics. 
It is already known that efficient estimates derived by the method of 
maximum likelihood are functions of sufficient statistics. The author 
has shown (Rao, 1945) that minimum variance estimates must neces- 
sarily be functions of sufficient statistics. The 1945 paper contains the 
bulk of the matter (on limits to variance) treated in 4a. 


4b Estimation by the Method of Maximum Likelihood 


4b.1 The Principle of Maximum Likelihood 


If @ is the probability density of the observations, then the likeli- 
hood * of the parameters’ occurring in ¢ is defined to be any function 


* In statistical literature the term “likelihood of the observations” is often wrongly 
used to mean the probability density of the observations. The probability density 
for a given set of observations may be considered as a function of the parameters 
which is otherwise termed as the likelihood of the parameters, 


CONSISTENCY AND BIAS 151 


proportional to ¢, the constant of proportionality being independent of 
the parameters. The principle of maximum likelihood consists in accept- 
ing as the best estimate of the parameters those values of the parameters 
which maximize the likelihood for a given set of observations. The esti- 
mates thus obtained from a primitive postulate satisfy some optimum 
properties which are considered below. 

If Ty, «++, Tm constitute a minimal set of sufficient statistics, then ọ 
is of the form 


= Py(Ty, «++, Tm | 01) r Oq)Palta, a) a | Try +++) Tm) 


so that maximizing ¢ is equivalent to maximizing P\(T | 6). The esti- 
mates of the parameters are necessarily functions of these sufficient 
statistics. This shows that the maximum likelihood estimates satisfy 
the necessary condition for possessing the minimum attainable var- 
iance. Under some conditions, when the number of sufficient statistics 
is equal to the number of parameters to be estimated, it was shown in 
4a.9 that this is a sufficient condition for minimum variance estimates 
of suitably chosen parametric functions to exist. The existence of 
sufficient statistics equal in number to the parameters to be estimated 
is rare, and it is of importance to study what properties the maximum 


likelihood estimates obey in general. 


4b.2 Consistency and Bias 

A statistic în, a function of the x observations in the sample, is said 
to be a consistent estimate of a parameter 0 if, for any two positive 
numbers ô and e, a number no exists such that when n exceeds no the 
probability that 
|in —-0| > 6 
This implies that with the increase in n, the sample size, 
rence between the statistic tn and the parameter 
0 will exceed any given amount decreases. If such statistics are used, 
the accuracy of the estimate increases with the increase in the observa- 
tions, and ultimately the true value of the parameter is approached. 


The property of consistency can 
Pi|t—9| <a} >1—-« 
or ta — @ stochastically. This must be differentiated from the math- 


ematical limit where the property | fn — 6 |< ô holds unconditionally 
when n > no. Ina stochastic limit the statistic tn can differ from 6 by 
more than ô when n > no, but it does so with a frequency tending to 
zero as n becomes large. If no, 8° determined, is independent of @ in an 


interval (a, b) the consistency is said to be uniform in (a, b). 


is less than e. 
the chance that the diffe 


be simply expressed as 


152 ESTIMATION AND MAXIMUM LIKELIHOOD 


It may be noted that consistency does not imply unbiasedness of the 
statistic for any given sample size; it may not be so even in the limit. 
Since only moderately large sample sizes are met with in practice, it is 
important to calculate the bias in any estimate and correct for it, if 
possible. In some cases the bias may be ignored if it is known to be 
small enough not to invalidate any inference drawn by using the estimate. 

Example 1. Consider the sample of n observations from a normal 
population. The m.l. (maximum likelihood) estimate of o? is (x; 
— £)?/n which has the expected value (n — 1)c”/n, so that there is 
some underestimation. In such a case the bias can be corrected by 
using the estimate 2(x; — x)?/(m — 1) whose expectation is exactly o°. 
Such a correction is unimportant when the sample is large. 

Example 2. Consider n observations from the Cauchy distribution. 
It was shown in 2a.7 that the mean of n observations has the same 
probability density as that of a single observation. The probability of 


|@-pl>s 


where ô is an assigned quantity remains the same for any n so that @ is 
not consistent for p. 
Example 3. A statistic Tna such that 


E(Ln) = On —> 0 and V(Tn) > 0 as n> o 


is consistent. 
To prove this we shall first prove a lemma due to Tschebyscheff. 


Lemma. If x is a stochastic variable such that E(x) = u and V(x) 
= o°, then 


o 
P(|z— u| >ô) <a 
where ô? is any assigned quantity. 


Proof. If f(x) denotes the probability density of x, then 


to 


+o 
=f Œ- Dfa) de 


—% 


Bo +o 
>f e-oot+f wipe ae 


> | im dink J o d| 


CONSISTENCY AND BIAS 153 
Hence the lemma. Using this lemma in the above problem we have 
V(Pn) 
e 


P| Tn — %| <8) >1— 


If | Ta — 0, | < ô, then | Tn — 8| < ô + | 8 — 8n | and 
V(Ta) 
PT, —0| < +|0— 0al) > Pl Ta- %| <> 1-a 


Since 0, — 0 and V (Ta) — 0, there exists an no such that for alln > no 
|o—0|<8 and  V(Ta) < ae 

If |7,—0| < 5+ |0 — 0n l, then | Ta — 0| < 6+ ô1; therefore for 

n > no 


PAT, — 0| < 8+ è) > PU Ta —9| < 8+ |O M 


the result is established. It may be in- 
uniform if the mathematical 


))>1—e 


Since ô and 6, are arbitrary, 

ferred that the stochastic convergence is 

convergence of E(Tn) and V (Tn) is uniform. pa 
Example 4. Consider the bivariate normal distribution 


const. exp — aA 
-p 
(@ — m)? s Q(x — m)(y — He) + (y — a) dud 
o 0102 a” 


The bivariate moment prs is defined by 


E(w — mU pe)? = oy'o2°E(E"n") 
where 
1 2 
T 8 2 Ë 2 Ta 8 d d 
BE) = const. fon |- aq — 5 * 2an +08) | 9 E dn 
mE 
=y 3 oy — = 
= const. [ëe HE dé |a ex | 2- 1 — p m 
= ays + sas T ps7 aa Sega Pp Vs Pr +2 C ida 
= davw + rot pvr—1Ys+ pare (by symmetry) 
where 


aat- 5 


154 > ESTIMATION AND MAXIMUM LIKELIHOOD 


dg E 
= ler 24 dn = iO if tis odd 
j V 2r Í 
[i12] 
(0/2! 
It is easy to see that the expression for E(é*) vanishes whenever 
(r + s) is odd. We thus obtain 


if ¢is even 


H20 = oy” Ho2 = o? Hi1 = pojog 
= 30;* = 302 = 3po102 
H40 1 Hoa 02 H13 PO102 
u31 = B8pci%e2 u22 = (1 + 20)o: o? and so on 


The ratio y/7x is called an index. To determine the moments of y/x 
it can be formally expanded. 


-1 
mo 
tT m H2 Hı 


H: g g 1 £ 2 
-2h ~ 245-426 me ag 5 
Hı wy M He mum m 
Taking expectations of both sides, 
z5 -2f PEHÉ e AEL ALL.) 
T Hı Hi Hı Hike H1 H2 
a —| an —| 
Ha {2 (2127n? (2t + 2)1Q2~@4y, 2+1 
= {> Le ple 2; . 
m Vo t! 0 (+1)! 


ll 


uo 2 Gama 
E {1+ (= pma) a 
1 1 t! 


where v, and və are the coefficients of variation of x and y, respectively. 
The series on the right-hand side converges only when the coefficient of 
variation v; is small. 


If n pairs of observations on x and y are available, we can construct 
two statistics 


a 2t 1Q-# 2t—1 
ET) =" fi + C — on erai 
1 


1 t! 
(212 7 v y 
iho \Vn | 


1 
E(T2) = = fı F Wa (vı — pv2)È 


THE CONCEPT OF EFFICIENCY 155 


since the coefficient of variation of £ is vı /Vnand of y is v2/ a/n. Itis 
seen that E(T2) — p2/pı while E(T:) remains the same for all n and 


r 1 
since both have variances of O al T converges stochastically to 


u2/m and T; to some other value. 

T; is a biased estimate of „2/41 and does not admit a simple correc- 
tion for bias. Since the bias does not tend to zero asn — %, it should 
be considered inconsistent as an estimate of the parametric function 
u2/m. On the other hand, T% is a consistent estimate of the ratio. 

In biometric work, extensive use is made of the indices, and in many 
cases the mean indez is calculated by taking the ratio of the mean values 
of two characters. The estimate so obtained is not strictly comparable 
with that obtained by taking the average of all indices. The expected 
value of the index may be different from the ratio of the expected values 
of the two characters for which T% is a good estimate. The index should 
be treated as a separate character for evaluation of its constants, mean, 
standard deviation, etc. A comparison of the indices involves the coni- 


parison of a function of both the mean values and second-order moments. 
4b.3 The Concept of Efficiency 

Of all statistics which converge stochastically to a parameter 0, the 
Practically important ones are those that converge rapidly. Such 
Statistics give large deviations from the true value less frequently, at 
least in large samples, thus satisfying the requirements of a good esti- 
mate considered in 4b.2. We need, then, a criterion for judging which 


of two statistics converges more rapidly. If only statisties which are 
ed are considered, then the rapidity 


ariance or the reciprocal of variance. 
ribution the probability of a departure 
deviation is a decreasing function of A 
a departure’s exceeding a given value 
decreases with decrease in the variance. Therefore that statistic 
With the smallest asymptotic variance is preferred and is called the effi- 
cient estimate. The efficiency of any other estimate can be measured 
by the ratio of variance of the efficient estimate to that of the other. 

Efficiency, though linked with minimum variance, 15 onal fe 
large sample concept. Statistics with minimum variances cona iy in 
4a are, no doubt, the most efficient ones m large samples, in a 1 case 
Variance acquires a special significance. — But in ee i; Fi : 
not sufficient justification for using invariance as the criterion for select- 
Mg a good estimate. : 

Tt may ts e that the comparison 18 confined ar to apee ji 
Statistics which are asymptotically normally distributed. o 


156 ESTIMATION AND MAXIMUM LIKELIHOOD 


serious objection in view of the fact that a large class of statistics obeys 
this property. 

The properties of minimum variance estimates considered in examples 
of 4a.4 hold good for efficient estimates also. 

Example 1. Asymptotic distribution of quantiles. Let f(x) dx be 
the probability differential of x, and define 


Pea) =f fa) dx 


If x is the quantile of order p, (0 < p < 1), in a series of n observations, 
then u = [np] observations are less than x, and (n — » — 1) are greater 
than x + dz; the remaining value falls between z and x + dz. Hence 
the probability differential of x is 


ma at POM — FEY) ae 
Let y = F(x) so that dy = f(x) dx. The distribution of y is 
ulin — u -= 1)! 


Let z = {v/n(y — p)}/v/pq; then the distribution of z becomes 


yh(L — y)" dy 


r ( P: — (: Viale n ii 
const. ~ = const. € 
Tr Soe z = cons z 


where 


v = brlog (1+2 fE) + éa = ba = Dog (1-2 2)~-Ż 


which shows that y is asymptotically normal with p as mean and pq/n 
as variance. Since y = F(x), assuming that the inverse function exists, 
we can write 


z = Fy) = $(y) 
or, expanding at y = p, 
es 2 
x = $(p) + (Y — p)o' (p) + aa $"(p) fe 


Neglecting terms of the order (y — p), we find that x is distributed 
normally in large samples with mean $(p) and variance [¢’(p)V(y) 


MAXIMUM LIKELIHOOD ESTIMATES 157 


. (see 5e.1). The value of $(p) = ¢ is determined from the relation 


p - f fo dx 


Also, since 
1 1 


9) = FE TO 


va) = vorra = voe F 
ormality is rapid only for quan- 


It must be noted that convergence to n 
er or higher quantiles normality 


tiles of order p near about 4. For low 
is realized only in very large samples. 5 

Example 2. Efficiency of the median as an estimate of the mean of a 
normal population. For a normal population the average of the observa- 
tions is the most efficient estimate of the mean, and its variance is o”/n. 


The asymptotic variance of the median is 


iC) 
ån 


where f(z), the ordinate at the mean of the normal distribution, is equal 
to 1//2ae. The variance of the median is 
To 


2n 


so that its efficiency is 2/7 or 63.7% 
4b.4 Some Optimum Properties of Maximum Likelihood 


Pa distribution from which a sample of 
Let fẹ x be the probability istribution fro 
size n j= j as r is Tava. We shall denote f(z, 0) by fi and the 


s ing assumptions are made. . 
oe nS and 8? log L/30 exist and are con- 


(i) The derivatives ô log j : 
tinuous for every 0 in a range R, including the true value, and for almost 
all x. For every @ in R, 

a log L 


HAR | ae |< ra) 
06 
(-2, +o). 


: i T 
where F, and Fo are integrable functions ove 


158 ESTIMATION AND MAXIMUM LIKELIHOOD 
(ii) The derivative ð? log L/a6° exists and is such that 


log L 
| ee lve M(@) E{M()}<K (a positive quantity) 


(iii) For every @ in R, 


"et Plog L = 
J- z Lde = 1(0) 


is finite and non-zero. 
(iv) The range of integration is independent of 6. 
Under these conditions the following theorems will be proved. 


Theorem 1. With probability approaching unity as n —> œ, the 
likelihood equation ô log L/d0 = 0 has a solution which converges in 
probability to the true value 6. (Dugue, 1937.) 

Theorem 2. Any consistent solution of the likelihood equation pro- 
vides a maximum of the likelihood with probability tending to unity as 
the sample size tends to infinity. (Huzurbazar, 1948.) 

Theorem 3. A consistent solution of the likelihood equation is asymp- 
totically normally distributed about the true value 0. (Cramér, 1946.) 


Some of the limiting theorems used in this connection are given with- 
out proof in an appendix at the end of this chapter. Under the condi- 
tions assumed we have, following Cramér (1946), 


dlogL /dlogL 2 log L 0 — 09)? /8? log L 
-( Jan + 0 = ( g ) + t ). 


00 00 30? 2 (lie 
where 6’ lies in (9, 8o). Dividing both sides by n, we have 
10 log L Bo 
es = BR, a E fg wm BEND 
P o + Bı(0 — %) + 2 (9 — 0o) 
where 
1 2 /d log £) 
=- E(Bo) = 
Dez 2 ( a0 o (Bo) 


1 & /2 log f; I 
By = SHI A ) E(By) = — 1 where I(0) is the infor- 
ni 06 8% n 
> 
1 


mation 
g log A) , 
a Jp 


MAXIMUM LIKELIHOOD ESTIMATES 159 


By Kintchine’s theorem the quantities Bo, Bı, and Be stochastically 
converge to their mean values. Given two quantities 6 and «, it is pos- 
sible to find n > no(é, e) such that 


P, = Pil Bol 28) <5 
11(6) € 
Pp = PoR Ae Ee 


Ps = P{| Bo| 2 2K} <$ 


The probability that the sample point is such that the inequalities 


1 I(0) 
(B]at Baa | B2| < 2K 
2 n 
are simultaneously satisfied is evidently greater than 1 — Pı — Pa 
— P3 (= 1 — «). Let S denote the set of such points. 
For 6 = 0) +6 


aige = Bo + ôBı +i 
06 2 

For every point in S, Bo+ YB < (K + 18, and B< — 
MU (08/2. If 6 < [L(6)/2n(K + 1)], the sign of the whole expression 
for Ø = Øy + 6 will be determined by 'B,ô so that (a log L/a8) > 0 for 
dm th — Band (3 log 2/00) <ptore Gat 0. inaa: log L/00 is a 
continuous function of @ in R, for almost all z, it follows that the likeli- 
hood equation has a root in 0 + § with probability tending to unity. 


This establishes theorem 1. 
Let 6 be a consistent soluti 


pii =m lern 7} 


on of the likelihood equation so that 


asn == @ 


By the mean value theorem 


1a logt _ 1¢ lol (PRET) G- a) 
n oð? n 000° n\ 00 

Hence the modulus of the left-hand side is less than Ky sith probability 

tending to unity, which means that (1/n)(0” log L/a6°) converges in 

Probability to (1 /n)(0? log L/€00") which tends to —1(60), bees ae) 

1s the information per single observation. Therefore, for any arbitrarily 


160 ESTIMATION AND MAXIMUM LIKELIHOOD 
small ¢ we have 
1 8? log L 
P\- a2 
n* 06 


< —i(M) + e) > 1 asn > œ 


The quantity —7(4) is fixed and negative and for small e’, —i(0o) + € 
is also negative. Therefore 


12 log L 
P=- > 0) 
n 06 


2 log L 
E <0]—=1 asn > œ 


This shows that the probability that the likelihood is a maximum at 6 
approaches certainty as n tends to œ, thus establishing theorem 2. 

If 6, and 65 are two consistent solutions of the likelihood equation, it 
follows from Rolle’s theorem that ô? log L/d6? = 0 has at least one solu- 
tion 03 lying between 6; and 4. Since 6; and 62 converge to 6o in proba- 
bility, 03 also does. Therefore 


Z log L 
303? 


<o} 1 asn > œ 


which is a contradiction because ð? log L/d0;2 = 0. We thus obtain 
the following corollary. 


Corollary. A consistent solution of the likelihood equation is unique. 


Wald (1949) has recently proved that the solution of the likelihood 
equation which makes the likelihood function an absolute maximum is 
necessarily consistent. This is more powerful than theorem 2 which 
says that the likelihood has a relative maximum at the unique con- 
sistent solution of the likelihood equation. This proof is not given here. 
_ Let the consistent solution of the likelihood equation be denoted by 
6. We have 


1 5 C log £) 
VTG- oo) =~) tne = 
1 2(0 — o 
(6) 2I (60) 


The denominator of the right-hand fraction converges to 1 in probability. 


By the Lindeberg-Levy theorem the sum )} (a log f;/d)y is asymp- 
1 


totically normal with zero mean and variance Z(0). Therefore the 


IMPROVED ESTIMATES FROM INCOMPLETE DATA 161 


numerator is asymptotically normal with zero mean and unit variance. 
Finally, it follows from the convergence theorem 1 in the appendix that 
the ratio v/T (Go) (6 — 6) is asymptotically normal with zero mean and 
unit variance. This means that Ê is distributed about’ with the var- 
iance 1/I(0). Since this is the minimum possible variance, we have the 


following result, when ZÔ) and VÔ) exist. 


‘ Corollary. The consistent solution of the maximum likelihood equa- 
tion is fully efficient. 


4c Some Examples of Maximum Likelihood Estimates 


4c.1 Improved Estimates of Means from Incomplete Data on 
Several Variables 

Suppose that each individual of a populati 
Measurements but in a sample only one measurement is recorded in 
some cases. Thus, out of a total of N = (m +m +n) individuals 
observed, n, of them may provide the first measurement alone, n2 the 
Second alone, and n both the measurements. The characteristics of, 
Say, the first measurement, such as the mean, scatter, etc., can be esti- 
mated from the available set of (nı + n) observations on the first 
Measurement alone. If the two measurements are correlated, it is 
Possible that the observations on the second measurement may throw 
Some more information on the characteristics of the first. If this is so, 
the estimates of the characteristics of any particular measurement ob- 
tained by taking all the data into account will be more accurate than 
those obtained from the available set of observations on that particular 
measurement alone. 

Assuming normal distributions for th 
density of the observations can be written 


Zi(e — m)? 1 
zt a-e) 


on is characterized by two 


e measurements, the probability 


const, exp — 1 | 
2 T1 


a9 


=e 2 >i = 2 

Z mw)? 298(e— wy — m2) 4 70 yal), 4 er 

ee ri 
oy? o102 o2 

Where ements, #1 and #2 their mean values, 

a anly ane tlie tw AT i d the correlation coefficient 


py -d deviations an p 
ees earl a stantial e n over the nı observations, 22 the 
the summation over the 


etween x and y, 21 the summatio 
and 2 


S : x 
ommation over the ng observations, 
Ommon set of n observations. 


162 ESTIMATION AND MAXIMUM LIKELIHOOD 


Equating the derivatives of the logarithm of the likelihood with 
respect to u; and pe to zero, 


Zs- m) 1 E(z— m) pE(y— m) 
S= 2 t 2 Taide = 
CA Lp ar T102 
Ely — il Dy — D(z — 
So aly 3 be) Z : l (y =e _P ( =, =" 
Oa 1— ø o2 0102 
or 
{= n | Pp Mg Dır 1 [= = 
ü c1 o(1 — p°) L= p? T2 c1 b= E 1 o2 
P nm {= i n } _ Zy 1 = = 
l1- Pa o ca(l — p°) o2 1— p? lo o 


These are simultaneous equations giving the estimates of m and po 
when 1, c2, and p are known. The equations giving the estimates of 
c1, o2, and p are complicated, but the following estimates may be safely 
used when the sample sizes are moderately large. 


Ze + Za)? 
Ziy + Zy)? 
og? = [2a + 2y" -Zuta + (m2 +n- 1) 
2 2 
p= [ey - 2) P J- Œa) Har Œ | 
n n F 


To obtain the measures of accuracy of the estimates of pı and pe the 
information matrix is derived. 


E(S;?)  E(S182) 
(ij) = 2 | 
E(S,82) E(S?) 
ny n —np 
2 o? o? (1 a p°) oyo2(1 — p°) 
—np ng ie n 
o402(1 — p°) o? c(l — p”) 
[no(1 — 9°) + nlo?  npoioz 
" A A 
aï) = 
B Npo\c2 [n (1 P °) F nlo? 


A A 


IM. D 
MPROVED ESTIMATES FROM INCOMPLETE DATA 163 


where A = (n + 
nı) (n pj = prt / 
aon stil es thea 1)(n + n2) — Pnn. If m and ms are the estimates 


VG) = no(1 — P) +” r 
(n + m)(n + no) — nme į 
To= m(1 — P) +” F 


. (n + my) (n + m2) — Pnn ~ 
he esti i 
i, ier of u obtained from the mean of (n + n) observations 
3 measurement has the varian a 
efficiency of this estimate is a 
n 
wni (n + no) — pn 

1 

(n + ni) (n + na) — Pnme 


n fı __™ e) x fı = nino a) 
JEt n + ne (n + m) (n + n) 
8 skeletons of Anglo-Saxons (Münter, 1936) 103 provided maxi- 


mum 5 
aie po e of both the right and left femora, 48 of the right femur 
, and 37 of the left. The 151 observations on the right femur gave 


the estimates 


8 Mean = 463.3 sı = 22.4 * 
the 140 observations on the left femur gave the estimates 
Mean = 465.7 So = 24.4 


The A 
tae pairs of observations gave the estimated correlation r = 0.9835. 
e equations for the combined estimates mı and mə of pı and ps are 


n 
m \— n s lps 2E 1 [F 
1 +" 5} ate gee L-e] 
i adi) 1—7? 8 sı tor la Pi 
S ae. ng n Su a A 2 
a +m {2+ ojos a Boe 

I á E AA 
n this example 

m 4 m 8 

i © ce Se 

s 224 s 244 

; 103 

-e a 140.4989 

s(1—7°) 22.401 — 0.9835") 

-i e a = 128.9826 

(1—7) 244l- 0.9835°) 

and r, respectively. 


* 
T i 
he estimates of o1, c2, and p are represented by 81, 82 


164 ESTIMATION AND MAXIMUM LIKELIHOOD 


Dix = 21851.4 (The sum of observations on 48 maximum lengths of 
right femora) 

Loy = 16999.3 (The sum of observations on 37 maximum lengths of 
left femora) 

Zax = 48096.9 (The sum of 103 observations on the right femur 
from the common set) 

Ly = 48198.7 (The sum of 103 observations on the left femur from 
the common set) 


With these values the equations can be written 
142.6418m, — 126.8544m = 7221.5778 
—138.1807m, + 130.4990m2 = —3470.9599 
The solutions are 
mı = 462.43 Mz = 463.06 
The variance of m is 


na(1 — 7?) +n > _ 104.21093 
= s = s 
(n + m)(n + ne) — nino * ~ 19422.1245 
The standard error is 


1? = 0.0053655s1? 


V 0.0053655s; = 0.07324 X 22.4 = 1.64 
The standard error of the average of 151 observations is 


22.4 
. 151 


which is greater than the standard error of the maximum likelihood 
estimate mı, which is as efficient as the mean of about 1/0.0053655 
= 186 observations. Similarly, the standard error of mə = 1.82 and is 
as efficient as a mean based on about 185 observations. There is, how- 
ever, some loss of efficiency due to errors in the estimates of c1, o2, and 
p, but this is very small. 

This technique can be employed in many situations. For instance, 
the maximum length including spine cannot be obtained for all femora 
since the spine is usually broken. If the maximum lengths including 
and excluding spine are available for some femora, and only excluding 
spine for others, the best estimates of the means of both the measure- 
ments can be obtained by following the above procedure. If there are 
no observations for the maximum length including spine alone, the 
quantity nı is equated to zero in the above equations. It may happen 


= 1.79 


SCORING FOR ESTIMATION OF PARAMETERS 165 


as in 3f.2 that the skeletons providing measurements on both the right 
and left femora may be undersized. Using this portion of the material 
two prediction formulae can be derived, one for predicting the length 
of the left femur given that of the right, and another for predicting the 
length of the right femur given that of the left. With the help of the first 
prediction formula the average length of the left femur can be estimated 
for those skeletons providing measurement on the right femur only. 
Let this be Į, based on n; right femur measurements. The direct aver- 
ages of the length of the left femur for the n skeletons providing both the 
measurements and nə skeletons only the left femur length are denoted 
by Zand Jy. If all three types of material are random samples from the 
original skeletal population, then the three estimates h, J, fz should 
agree, in which case the estimate obtained by the method of maximum 
likelihood is the best. If they do not agree, then the estimate of the 
mean left femur length of the skeletal population may be obtained as 


mh + nl + rele 
ny Hn +n 


Similarly the mean length of the right femur can be estimated. 


4c.2 The Method of Scoring for the Estimation of Parameters 

The maximum likelihood equations are usually complicated so that 
the solutions cannot be obtained directly. A general method in such 
cases is to assume a trial solution and derive linear equations for small 
additive corrections. The process can be repeated till the corrections 
become negligible. A great mechanization is introduced by adopting 
the method known as the scoring system for obtaining the linear rela- 
tions connecting the additive corrections. Rs 

The quantity d(log L)/d9, where L is the likelihood of the parameter 
0, is defined as the efficient score for 0. The maximum likelihood esti- 
mate of @ is that value of @ for which the efficient score vanishes. If 8o 
is the trial value of the estimate, then expanding d(log L)/d@ and re- 
taining only the first power of ô = 8 — ĉo, 

dlogL _ dlog L 4 pte 


dé do doo? 
leh _ 55104) 
0 


i i = bo, is the expected value 
where I(6)), the information at the value 0 bo, is 
of los a In large samples the difference between —I (80) and 
@ (log L)/d0p? will be of O(1 /n), where n is the number of observations, 


166 ESTIMATION AND MAXIMUM LIKELIHOOD 


so that the above approximation holds to the first order of small quan- 
tities. The correction 60 is obtained from the equation 


dlog L 
ai= E 
doo 
dlog L 
60 = = I(0) 


The first approximation is (8o + 68), and the above process can be re- 
peated with this as the new trial value. 
Example 1. Consider a sample of size n from the Cauchy distribution. 


1 dz 
ml ++ (z -— 0) 
The likelihood equation is 
dlog L 2(a; — 0 
eel, Aa 0) 


do 1+ (r0)? 
The efficient score for any 6 is 
S@=+2 2(x; — 0) 
1+ @ — 0? 


Information for a single observation is 4 so that the asymptotic var- 
iance is 2/n and the additive correction to a trial value 8o is 
28(0) 
n 
This process can be continued until a stable value is attained. Fortu- 
nately in the above example I(0) happened to be independent of @ so 
that I(@) need not be calculated at each trial value. 

Example 2. Score and information for grouped data. 

Let 71, T2, ***, Tk, (27; = 1), be the probabilities in k mutually 
exclusive classes, and suppose that 7; = ¢;(0) so that all the proportions 
ri are defined as functions of a single parameter 6. If fi, fo, «++, fy are 
the observed frequencies, then 


log L = fı log mı +--+ -+ fr log Th 
The score at 8 is 
alog L _ fıðmı |, Sx Ome 
a0 m er 


SCORING FOR ESTIMATION OF PARAMETERS 167 


Information is the variance of ð log L/d@ which is a linear function of 
the frequencies. Hence by (2a.9.1) 


10-102 (=) fapta 


jai Ti \ 00 


1 On; 1 (3) 

— and —(— 

mi 00 mi \ 00 

which may be called the score and information supplied by the 7th class, 


are to be derived in any particular problem before proceeding with the 


problem of estimation. 
If two factors are linked with a recombination fraction p, then the 


intercrosses AB/ab X AB/ab (coupling) and Ab/aB X Ab/aB (repul- 
sion) give rise to the following expected proportions and information as 
given by Mather (1938). i 


The quantities 


Coupling Repulsion 
Class 
Probability Score Probability | Score 
1dr 4 1 dr 
en 7 do G x do 
—2(1 — p) 2 2p 
z || eee || B - 
AB 3—2p+?7 TENET +p 247 
j 2(1 — P) af —2p 
Ab 2p — P? ap ae LP Ter 
P 2(1 — P) A —2p 
aB 2p — P“ 2p — P? 1=p 1- 
a| a -A # 2 
ab 1— 2p +P I- 2p +r ? 
x 23 — 4p + 2p") 2(1 + 2p") 
Information: > —p@—m2+F) | +U -7 


an be used to judge the relative efficiency 


Th i tion c: h 
G amount of informat other for the estimation of the 


of one type of cross with respect to the « 3 
recombination fraction. For instance, if p = 14, the amounts of in- 


formation for coupling and repulsion are 3.7909 and 1.1636, respectively. 
This means that, using intercrosses with repulsion, the number of off- 


168 ESTIMATION AND MAXIMUM LIKELIHOOD 


springs needed will be three times as great as that for coupling to esti- 
mate the recombination fraction with the same precision. 

Consider coupling data with values for AB, Ab, aB, and ab as shown 
below. With the trial value p = 0.21, the score and information are 
obtained. 


gA lar Observed 

7 4r öp T Op Frequency 
AB 2.6241 —1.58 —0.60211 125 
Ab 0.3759 1.58 4.20325 18 
aB 0.3759 1.58 4.20325 20 
ab 0.6241 —1.58 —2.53164 34 
Absolute sum 11.54025 197 


Information per 1.58 
observation = Ea (11.54025) = 4.55840 


Efficient score = —0.60211(125) + 4.20325(18 + 20) 
— 2.53164(34) 


— 1.61601 


1.61601 
= —0.0018 


Correction term = — ————— = 
197 (4.55840) 


Second approximation = 0.21 — 0.0018 = 0.2082 


The correction is small so that the process may not be repeated. The 
variance of the estimate is given by 


=o eaa = 00111357 
197 (4.55840) 


A better estimate of the variance is the reciprocal of the information 
at the value p = 0.2082 

The scores for trial values from 1 to 50% are given in Table XIV; of 
Fisher and Yates (1948). They can be directly used by retaining two 
decimal places at each stage of approximation, and finally when two 
places are stabilized a complete calculation with more places may be 
carried out. 

Example 3. Scoring for several parameters. 

The method of scoring developed in example 2 can be extended to 
the case of the simultaneous estimation of several parameters. If 61, 


SCORING FOR ESTIMATION OF PARAMETERS 169 


92, -++, 0, are the parameters, the ith efficient score is defined by 
dlog L 
S: = — i=1,2- 
00; 


where L is the likelihood of the parameters, and the information matrix 
is defined by (Z,;) where 

1,3 = E(S:S;) 
If the values of the efficient scores and information at the trial values 


0 . aye 
hi >*t, 0g? are indicated with index zero, then small additive correc- 
tions to the trial values are given by the simultaneous equations 


I° d0, + I2” dbo eet Tig? dha = $,° 


Ia? d0, + Iyo? dôs +++ Tan? Wa = Sa? 
This operation is repeated with corrected values each time until stable 
Values of 0, +++, 04 are obtained. The variance of the final estimate 
9; of 0; is given by Z”, the co-factor of T,; in the determinant Ei il: 
In the case of grouped distributions with 7; and f; as probability and 
frequency in the ith class, 


and 


50 that the calculations become simple as illustrated below. f 
Blood Groups, ABO System. Every human being can be ee 

Ito one of the four blood groups 0, A, B, AB. eS ent of 7 u 
ood groups is controlled by three allelomorphic genes—0, A, 4 

which O is recessive to A and B- IÉ” P» and g are gene Se . 

O, A, and B, then the expected probabilities of the six genotypes (ou 

, . 
Phenotypes) in random mating will be 


Phenotype Genotype aaa ati 
0 O i 
ae r |e + 2pr 
a AO 2pr 
g je + 2r 
a BO 2gr 
2pq 


AB AB 


170 ESTIMATION AND MAXIMUM LIKELIHOOD 


If O, A, B and AB are the observed frequencies adding to N, the 
problem is to estimate the gene frequencies p, q, and r. A rough esti- 
mate is supplied by 


1_,_ BB 
. N 
O+A 
f=1= J= 
N 


These may not necessarily add to unity, whereas the true values should. 
Let D denote the deviation 


-D=p7'+¢4+r-1 
Better estimates due to Bernstein are obtained as follows. 
r = (1 + 2D)(r + 3D) 
p = (1 + 4D)p' 
a= (1+ 3D) 
where p’, q', r’, and D are as defined above. 

There is still some deviation, (1 — p — q — r) = 14D?. If this is 
small, then Bernstein’s method supplies fairly good estimates. We 
shall now show how these estimates can be improved by the method of 
maximum likelihood, using the frequencies O = 176, A = 182, B = 60, 


and AB = 17. Approximate solutions obtained by Bernstein’s method 
are 


p = 0.26449 q = 0.09317 r = 0.64234 


The probabilities and derivatives, with respect to the independent 
parameters p, and q in the general case, are 


Derivatives 
Probability or on 
T op oq 
(6) r ` —2r —2r 
A plp + 2r) 2r —2p 
B alq + 2r) —2q 2r 
A 


B 2pq 2q 2p 


SCORING FOR ESTIMATION OF PARAMETERS 171 


The probabilities and coefficients for the calculation of efficient scores 
at the approximate values obtained above are set out below. 


Coefficients for Scores 


Probability lar lox Observed 

T r öp T Og Frequency 
(0) 0.41260 —3.11362 — 3.11362 176 
A 0.40974 3.13543 — 1.27104 182 
B 0.12838 —1.45217 10.00685 60 
AB 0.04928 3.75086 10.73307 17 


The scores are 


$p = (—3.11362)176 + (3.13543)182 + (—1.45217)60 + (3.75086)17 


= —0.20444 
$a = (—3.11362)176 + (—1.27104)182 + (10.00685)60 + (10.73307)17 


= —0.09321 
The information matrix * for a single observation is 
Ipp = 9.00315 Ipa = 2.47676 
Inq = 2.47676 Iqq = 23.21612 
Small corrections ôp and ôq to p and q are given by 
N(9.00315 ôp + 2.47676 ôg) = —0.20444 
N (2.47676 ôp + 23.21612 ôg) = — 0.09321 
The inverse of the information matrix per single observation is 
pp? = 0.114430 Ipq = —0.012208 
IP? = —0.012208 I% = 0.044376 


The solutions are 
PQ, 
_ Pby + Pa _ _ 9 99005116 
N 


6p = 
aa, 
gat — 000000377 


this particular case. If the 


The i hardly necessary in 
E A ss has to be repeated with the 


corrections are not small, the whole proce 


1 (an) aze (122) 
eye G2 (2) a= (== 
sad so “ap 7 Op 


= —2r(—3.11362) + 2r(8.13543) — 29(—1.45217) + 2q(3.75086) 


= 9.00315 ete. 


172 ESTIMATION AND MAXIMUM LIKELIHOOD 


second approximations. It is important to note that after some stage 
the information matrix need not be recalculated for each approximation. 
Only the new scores have to be calculated at each stage and used in 
conjunction with the same inverse matrix of information (kept constant 
from some stage) to obtain closer approximations. When convergence 
is achieved, the information matrix and scores may be calculated for 
the last approximate values and the final approximation obtained. 
The maximum likelihood estimates and the variances are 


‘PP 


P 
p = 0.26444 V(p) = WV = 0.00026305 


ym 
q = 0.09317 —-V(q) = = = 0.00010202 


IP? 4. 2072 4 g% 
r = 0.64239 Vf) = —a = 0.00030893 


4c.3 Combination of Data 

The advantage of the scoring system can be best seen in the mechani- 
zation it introduces when various sets of data giving information on 
some parameters have to be combined for estimation. If L is the joint 
likelihood based on all the data and L; for the ith part, then 


L=IyLe + 
dlogL dlogl;, | dlog Lə 
a6, 30, 1 a, 1 


which shows that the efficient scores are additive. Also, if I,ẹ is an 
element of the information matrix for the whole body of data and I, 
for the jth part, then 


Ir = Fat + Te tees 


Thus, to obtain the best estimates it is necessary to replace each part 
of the data by the scores and information matrix at a trial value and 
obtain the total scores and information matrix by simple addition. The 
correction to trial values can be obtained by solving simultaneous equa- 
tions as shown in the previous section. 


Appendix: Some Limiting Theorems 


A general convergence theorem (Cramér, 1946). Let &, f, «++ be a 
sequence of random variables with distribution functions Fy, Fo, ++- 


LIMITING THEOREMS 173 


Suppose that F,(z) — F(z) as n — œ. Let 71, 72, -+ be another se- 
quence of random variables, and suppose that 7, converges in prob- 
ability to a constant c. If 


Xn = fn Fin Yn = nmn Za = — 
then the distribution functions 


of X, — F(x — ¢) 
x 

of Yn = F(-) ife >0 
c 


of Zn — F(ex) ife>0 


The theorem covers the case of c < 0 also, in which case the variables 
—m, —n2, +++ would be considered. 

The theorem is proved for Zn, the proof being similar for the rest. 
The set S of points satisfying n/n < x consists of two non-overlap- 
ping sets: 


Si: Baa |m—e|<e 


Nn 
En 
S Z<r |m—cl>e« 
Nn 
Thus 
Pa(S) = Pa(S1) + PaSe) 
For every e 


P,(S2) < Palli — ¢| >64) —0 


by hypothesis, in which case P,,(S) lies within the limits 


Prlin (ce [mel <4 


This differs from the corresponding quantity 

Paltn LUE e)z} = Fr{(c + e)z} 
As n > %, Palm — el >e) > 0. 
Pn (S) is thus enclosed between two limits which can be made to lie as 


close as possible to F' (cx) by choosing a small. are aee ma 
condition of independence o. 
proved. It may be noted that no ¢ 


variables involved is assumed. p i 

Slutsky’s theorem (1925). TEE, y Pn BFE n yae 

verging in probability to constants 7, Y, *""» r, respectively, a e 

function R(ën, mm, **'s Pa) converges in probability to the constan 
ny ny ? 


by less than P, (| ia — c| > 6). 


174 ESTIMATION AND MAXIMUM LIKELIHOOD 


R(a, y, «++, r), provided that the latter is finite. It follows that any 
power R*(En, Mn, ***, Pn) With k > 0 converges in probability to R*(z, y, 
tees 7). 
Kintchine’s theorem. Let £, £2, - +- be independent random variables, 
all having the same distribution function F(x), and suppose that F(x) 
n 


has a finite mean m. Then the variable Ẹ = — bS £, converges in prob- 
nı 


ability to m. 

Levy’s theorem (1925). A necessary and sufficient condition for the 
convergence of the sequence {/’,(x)} of distribution functions to a dis- 
tribution function F(z) is that, for every t, the sequence {¢,(¢)} of 
characteristic functions converges to the limit (t) which is continuous 
for the special value ¢ and is the characteristic function for F(z). 

Central limit theorems. (1) Lindberg and Levy (1922, 1925). If £1, £2, 

- are independent random variables, all having the same probability 
distribution, and if m and o denote the mean and standard deviation of 


every &, then the sum ¢ = x £, is asymptotically normally distributed 


with mean nm and standard deviation «Vn. 

(2) Liapounoff (1901). Let &, 2, «++ be independent random var- 
iables with means and standard deviations m, and o», (v = 1, 2, +++). 
Suppose that the third absolute moment of £, about its mean 


ps = E(| Ey — My i) 
is finite for every v. If p/e — 0 asn — ©, where 
P = pr + po +-+- 
then the sum >> £, is asymptotically normal with mean m = m, + mo 
T 


+ +++ and variance o? = o? + oo? +--+. 


References 


Bernstei, F. (1925). Zusammenfassunde Betrachtungen über die erblichen Blut- 
strukturen des Menschen. Z. indukt. Astamm. u. Vererb. Lehre, 37, 237. 

BHATTACHARYA, A. (1946). On some analogues of the amount of information and 
their uses in statistical estimation. (In three parts.) Sankhyd, 8, 1, 201, 315. 

Cramfir, H. (1946). Mathematical methods of statistics. Princeton University 
Press. 

Dueus, D. (1937). Application des proprietés de la limité au sens du calcul des 
probabilités á l'étude des diverses questions d’estimation. J. Bcol. Poly. 3, no. 4, 
305. 


REFERENCES 175 


Fisner, R. A. (1921). On mathematical foundations of theoretical statistics. Philos. 
Trans. Roy. Soc. A, 222, 309. 

Fisuer, R. A. (1938). Statistical theory of estimation. Calcutta University Reader- 
ship lectures. 

Fisuer, R. A., and F. Yates (1948). Statistical tables for biological, agricultural 
and medical research. Oliver & Boyd. Third edition. 

Huzurpazar, V. S. (1948). The likelihood equation, consistency, and maxima of 
the likelihood function. Ann. Eugen. London, 14, 185. 

Koopman, B. O. (1936). On distributions admitting a sufficient statistic. Trans. 
Am. Math. Soc., 39, 399. 

Levy, P. (1925). Calcul des probabilités. Gauthier Villars, Paris. 

Levy, P. (1937). Théorie de l’addition des variables aléatoires. Paris. 

Liarounorr, A. (1901). Nouvelle forme du théorème sur la limite de probabilité. 
Mém. acad. sci. St. Petersbourg, 12, 5. 

LINDENBERG, J. W. (1922). Eine neue Herleitung des Exponentialgesetzes in der 
Wahrscheinlichkeitesrechnung. Math. Zeitschr., 15, 211. 

Marnen, K. (1938). Measurement of linkage in heredity. Methuen & Co. London. 

MERRILL, A. S. (1928). Frequency distribution of an index when both the com- 
ponents follow the normal law. Biom., 20A, 53. 

Munrer, A. H. (1936). A study of the lengths of long bones of the arms and legs 
in man, with special reference to Anglo-Saxon skeletons. Biom., 28, 258. 

Newman, J. (1937). Outline of a theory of statistical estimation based on the classi- 
cal theory of probability. Philos. Trans. Roy. Soc. A, 236, 333. 

Rao, C. R. (1945). Information and accuracy attainable in the estimation of statis- 
tical parameters. Bull. Calcutta Math. Soc., 37, 81. . 

Rao, C. R. (1947). Minimum variance and the estimation of several parameters. 


Proc. Cam. Phil. Soc., 43, 280. a i : 
Rao, C. R. (1948). Sufficient statistics and minimum variance estimates. Proc. 


. Phil. Soc., 45, 213. 
Mi (1995). ' Über stochastiche Asymptoten und Grenzwerte. Metron., 5, 
, E. 
3, 3. 
Warp, A. (1949). o 
Ann. Math. Slats., 20, 595. 


A note on the consistency of the maximum likelihood estimate. 


CHAPTER 5 


Large Sample Tests of Hypotheses 
with Applications to Problems 


of Estimation 


5a The General Theory of Tests in Large Samples 


5a.1 The Nature of Statistical Hypotheses 

If the probability differential of a set of stochastic variables contains 
k unknown parameters, the statistical hypotheses concerning them may 
be simple or composite. The hypothesis leading to a complete speci- 
fication of the values of the k parameters is a simple hypothesis, and 
the one leading to a collection of admissible sets a composite hypothesis. 
In this chapter are discussed tests of these two types of hypotheses on 
the basis of a large number of observations from any probability distri- 
bution satisfying some mild restrictions and also their use in problems 
of estimation. 


5a.2 The Problem of Distribution 
There are two problems of distribution that are useful in deriving 
tests of significance for simple and composite hypotheses. Let 
T1y ey Ups Yi, °°") Ups °° 


be independent sets of observations from probability laws with densities 
represented by f(z | 8), foly | 8), -+> such that each function contains 
at least one of the unknown parameters 6;, ---, 0,. The likelihood of 
the parameters which is the same as the probability density of the 
observed sets of data is 


L = file | Dfa | 0) +++ 
As defined in Chapter 4 the ith efficient score is represented by 
dlog L 


00; 
176 


i 


THE PROBLEM OF DISTRIBUTION 177 


The mean values of these scores are zero. Their covariance matrix is 
represented by (a;;), and its reciprocal by (a7). Let there exist a 
positive quantity n such that 
1 of; 2+7 
LED) ERTA (5a.2.1) 
ï 00; 
are finite. Under these conditions, if the non-vanishing terms in the 
sequence ô log f;/d0;, (i = 1, 2, ---), for any j form a sufficiently large 
set, it follows from an extension of the central limit theorem to many 
variables that the limiting distribution of ¢1, ---, ¢, at the true values 
01, «++, Og tends to the multivariate normal form with zero mean and 
covariance matrix (a;;). 
const. e~”2? dg, --- døg 
where t 
Q = EZaÄhipj 
Hence Q is distributed, in large samples, as x? with b degrees of freedom 
when the true values of the parameters are 01, +++, Or. 
If the probability densities fi, fo, +++ are the same, it is enough for 
the limiting properties to hold that 


is finite for every j which is less restrictive than the condition (5a.2.1). 
Suppose that the @ parameters are subject to s restrictions defined 

by s independent relations. 
WO, 06°, %) =O t= 1,2,-++8 (5a.2.2) 


‘The maximum likelihood estimates are given by 


Oy; 5 
+ UNG, = 0 i=1,2 -k 
i : (5a.2.3) 

w=0 t=1,2,-+48 
ian multipliers. Let ĝi, ---, 6 be the maximum 
likelihood estimates. Since the set of equations (52.2.3) involve (k — s) 
linear restrictions on ¢:(ô), it is expected that the statistic 
x2 = TEa*i(6)o:(8) 050) 

which is (k — s) less than 


where à; are Lagrang' 


is distributed as x? with s degrees of freedom 


freedom for true values. E 
E eee = dexanastentod if we assume that the restrictions (5.2.2) 


specify s of the parameters O:—s-+1) °`? 0; (say) as functions of (k — s) 


178 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


| 
free parameters 81, -+ *, 3k—s, SO that the likelihood is an explicit function 
of these parameters only, and further that the joint distribution of 
Ôi, +++, Êk—s tends to the multivariate normal form in large samples 
with variances and covariances of O(n™!). It is known that the latter 
assumption is true provided that the probability laws satisfy the con- 
dition (5a.2.1), and further that the maximum likelihood estimates are 
uniformly consistent (Wald, 1949). This does not seem to be a necessary | 
condition, and the approach to normality is probably true under less 
stringent conditions. 

Let us take the case of two parameters and one restrictive condition 
which may be taken as ð = w(6,). The differential coefficient d0./d0, 
is denoted by \(6,). The maximum likelihood estimates satisfy the 
equations 


$1(6) + (61) ¢2(6) = 0 65 = w(6;) (5a.2.4) 
If the given relation is true, then the statistic 
Xo” = ZZa¥(0)o;(0)4;(0) (5a.2.5) 


depends only on 4; and is distributed as x? with 2 degrees of freedom at 
the true value of @;. The expression (5a.2.5) treated as a function of 
4, may be expanded in the neighborhood of 6. The first term is 


x1” = LZa4(6)b:(6)4;(6) 
The second term is 


—2(0, — 61)[b1(8) fa"! (a + dows) + a! (æra + Nee) } 
+ $2(8) [a (a2 + daze) + al? (a + ders) }] 
j 


= —2(01 — 61)[41(6) + ^¢2(0)] =0 (5a.2.6) 


by virtue of (5a.2.4). In the expression (5a.2.6) terms O(n?) only have 
been retained, 0¢;/00; being replaced by «;;, and terms of the type 


0a; 
(1 — 6) = $idj 
being omitted as they are of O(n~”). 
The third term can be easily shown to be } 
x2” = (01 — 61)?[o1(8) + 2da2(8) + d2e29(6)] 
Neglecting terms of higher order of smallness, we obtain 
xo” = x17 + x2” 
Since j 
Vi. — 61) 


= aô) + Aal) + A a0) 


THE x? TEST OF DEPARTURE 179 


it follows that xə? is distributed in large samples as x” with 1 degree of 
freedom. 

It can be demonstrated by expanding ¢,(6) in powers of (6, — 6;) 
that (6, — 6,) and ¢;(6) tend to be uncorrelated in large samples, so 
that xı? and x2” are independently distributed in the limiting case. 

Since xo? is distributed as x? with 2 degrees of freedom and x? with 
1 degree of freedom, it follows that the residual part x1? is distributed 
as x? with 1 degree of freedom. 

For s relations and k(>s) parameters, xo” can be expressed as a 
function of (k — s) parameters and split into two portions, one of which 
i$ x2” with (k — s) degrees of freedom measuring the discrepancy of 
the (k — s) estimated parameters from their true values, and another 
x1? with s degrees of freedom measuring the departures from the assigned 
relationships. 


5b Applications of the General Theory 


5b.1 The x? Test of Departure from a Simple Hypothesis 
The problem is to test whether nı, +++, mx, (Zn; = n), the frequencies 
in k classes, are in accordance with some hypothetical proportions 7, 
++, ty (Zr; = 1). The probability of the sample on the given 

hypothesis is 
n! : 
= ——— $a sai? 
ny! +++ Nye! 

There are only (k — 1) independent parameters which may be taken 


asm, +t, m1. The efficient scores are 
dlogL Ni k 


On; Ti Tk 


Qi 


Their variances and covariances are 


(- T ~) using the formula (2a.9.1) 


ll 
3 
z 


ax 
ii ae 
R= using the formula (2a.9.1) 
ij 
Tk = 
në j 
ah 
my °°* Tk 
.  mi(l — 7) ee) 
Ss au/— a = 


n 


180 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


The x? statistic is 


Dla oid; 
x awi(1 — wi) (= a8) = Wit; = =) (= A =) 
n Ti Tk N Nti Tk/ Nop Th 
_ wi(l — ri) (= z 2y- sami = E *) E 7 *) 
n Ti Tk n Ti Th Tj Th 
where d; = ni — nr; (i = 1, +-+», k). The above expression reduces to 
d? ae Observed — Expected)? 0 — EB)? 
1 Jopi k > ( p ) => ( ) 
nr NTE Expected E 


As shown in 5a.2 the large sample distribution of the above statistic 
is that of x? with (k — 1) degrees of freedom because the test is based 
on (k — 1) independent parameters. We shall, however, present an 
alternative way of deriving its distribiition. 

In 2a.3 it was shown that the multinomial distribution is equivalent 
to a product of Poisson distributions subject to the condition that the 
sum of the variates is n. If each of the individual cell frequencies is 
large, then the Poisson probabilities could be replaced by the normal 
approximation. If x; = (n; — nz,)/+/nz, then the approximate distri- 
bution of 21, +--+, ty is 


—M(ay2+--+ 2 
const, ee k eted ga... dep 
subject to the condition 
V nme + VnTrts +--+ nor, 


= (m — nm) =- — (ni — nri) = En; —n = 0 


Therefore by using the result in 2c.3 the distribution of x” is that of the 
sum of squares of k variates N (0, 1) subject to one homogeneous linear 
restraint, i.e., x? with (k — 1) degrees of freedom, 

If the deviations of the observed from the expected frequencies are 
subject to ¢ linear homogeneous restrictions on the total, then £2? is a 
x? with (k — t) degrees of freedom. 

Example 1. Bateson gives the following data concerning the segre- 
gation of two genes for purple-red flower color and long-round pollen 
shape in sweet pea. 

The results are from an intercross so that the expected frequencies 
on the hypothesis of independent segregation are in the ratio 9:3:3:1. 
Are the data in agreement with the expected frequencies? 


THE x° TEST OF DEPARTURE 181 


Purple-Long Red-Long Purple-Round Red-Round Total 


Observed 296 27 19 85 427 
Expected 3843 +16 1281+16 1281+16 42716 427 

2 
(Observed)! 364.7817 9.1054 4.5090 270.7260 649.1221 
Expected 


2 


0 
= 25 — ZO = 649.1221 — 427 = 222.1221 


This is significant on 3 degrees of freedom, showing that there is a 
departure from the expected. For the large sample test to hold, it is 
necessary that each cell frequency should be at least greater than 5. 
If any such frequency is small, two suitable cells can be combined to 
form one cell with a higher frequency. 

Example 2. The number of deaths due to cholera is 350 out of a 
total of 976 due to all causes in a certain week. Is cholera on the in- 
crease if it accounted for 44 of the deaths in the last week? 

The expected number of deaths due to cholera on the basis of 14 
proportion is 976/3 = 325.3. The value of x? with 1 degree of freedom is 


pay Gar sow a Saat ie 2.80 
325.3 650.6 (1)(2) (976) 


the probability of exceeding which is just less than 10%. This prob- 
ability is not appropriate in answering the problem whether cholera is 
on the increase. x? gives the probability for deviations both in excess 
or defect of the expected, whereas only the probability of deviations 
in excess of the expected is relevant to this problem. To determine 


Ve = 1.6733 


can be used as a normal deviate with zero mean and unit standard 
deviation. The probability of a normal deviate’s exceeding this value 
is less than 5%, which shows that cholera is on the increase. 

3. Test whether the frequencies 8, 3, 1 could have arisen 
probabilities. 

4 are all small so that the x? approximation 
re this condition, the x? with 2 degrees of 


this we observe that 


Example j 
from a trinomial with equal 
The expected values 4, 4, 
cannot be used. If we igno 
freedom is S 4 
2 1 
S 244 - 12 = 6.50 
4 4 4 
which has a probability between 2 and 5%. In problems where the 
expected frequencies are small in some cells, they can be combined with 
lice cells so as to have frequencies at least greater than 5 in each 


182 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


class. Such a procedure is not possible in the present example because 
all the expectations are small. This necessitates the evaluation of the 
probability of the observed distribution 8, 3, 1 and those less probable 
than this on the hypothesis of equal probabilities of the three classes 
of events. The probability for any observed distribution a, y, z is 


12! ZNP 
il) 


There are 91 partitions of 12 for which the probabilities are calculated 
below. 


Partitional Type Number Probability 
12 0 3 0.051882 
0.02258 
0.071242 
0.0°2484 
0.084140 
0.071242 
0.089314 
0.0°3726 
0.0°5589 
0.0°1490 
0.077452 
0.01490 
0.071737 
0.01043 
0.02608 
0.03477 
0.03130 
0.05216 
0.06520 


ad 
RAM AOMAMINIMAMODOO 
PRAOWORTOWORANWRNWH DE 
krWONWNKFONRFONHOKFOGHOOO 


The sum of probabilities less than or equal to 0.023726 corresponding 
to (8,3, 1) is 0.0537, which exceeds the probability obtained by the 
x? approximation. The approximation overestimated significance. 
Even according to the exact test, the probability being near 5%, the 
hypothesis may be rejected. 

Example 4. Out of 8 fossils discovered, 2 and 6 were identified as 
belonging to male and female. Is this compatible with the sex ratio 
TE? 

The expected values are small, as in example 3, so that an exact 
treatment is needed. The problem is the same as that of finding the 
probability of 6 or more heads or tails in 8 tosses with an unbiased coin. 
The total chance for 6, 7, and 8 heads is 

37 


("co + er + 8cg)2-8 = pa 


THE x? TEST OF GOODNESS OF FIT 183 


The total for heads as well as tails is twice the above probability equal 
to 37/27 = 0.289, which is quite high, so that there is no definite evi- 
dence against the 1:1 sex proportion. In a general case the term-by- 
term evaluation may be a difficult job. It is shown in 2a.1 that the sum 
of the probabilities for 0, 1, ---, 7 successes is given by the incomplete 
B-integral, 

n! 


q 
-——— fS x71 — x)" dx 
rin — r — 1)! Jo 
where g(= 1 — p) is the chance of a failure. This function is tabulated 
in the incomplete beta tables edited by K. Pearson. In the above 
problem 
p=q=} n=8 r=2 


In the notation of incomplete beta tables * 
index p = (n—r—1)+1= 6 
indexg=r-+1 = 8 
x = probability q = 0.5 


The tabular entry for x = 0.5, index p = 6, and index g = 3 is 0.1445312, 
which is the probability for 0, 1, and 2 heads. By symmetry the prob- 
ability for tails is also the same, so that the required total is 0.2890624, 
which agrees with the value obtained above. 


5b.2 The X? Test of Goodness of Fit 

The general problem in judging goodness of fit is to test whether the 
cell proportions can be expressed as functions of a fewer number of 
parameters. Thus, if O, A, B, and AB represent the four blood group 
classes, it may be desired to test whether the cell frequencies can be 
expressed in terms of gene frequencies of O, A, and B, or two independent 
parameters p and gq (example 3 in 4c.2). Here the values of p and q 
are not known, but what is a re of Ge consistency relations 

X ility expressions for the four classes. , 

mep ce ali distribution has arisen from a normal = 
tribution, the probability ri in the ith class bounded by a; anı 
be entered into this table. Index p is equal to 1 
tegral, and index g is 1 plus the power of a-z). 
limit of the integral. The probabilities p and g 
q of the table. 


* There are three quantities to 
plus the power of z in the above in 
The entry x of the table is the upper 
are not to be confused with indices P, 


184 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


| i 1 (e—n)?/20? 
g ESRA T: 
ai oV lr 


so that the cell probabilities could be expressed in terms of two parameters 
u and o. 

There are in general (k — 1) independent proportions Tis 0%) Tkl 
specifying the distribution in & classes. If these proportions can be 
expressed as functions of s independent parameters, then all the (k — 1) 
proportions can be expressed as functions of s suitably chosen pro- 
portions. Thus there are (k — 1 — s) restrictions on Wi; °°") Whogs If 
we construct 


Qipi is 


p0- E 
E 


over all classes and substitute for 7, -- *, Tk—ı their best estimates 
subject to the (Æ — 1 — s) restrictions, then the x” has (k — 1 — s) 
degrees of freedom. This can be used to test whether the specification 
is correct or not. 

To estimate 71, +++, m1 subject to (k — 1 — s) restrictions, it is 
enough to estimate s parameters in terms of which Ti, ‘**, Tey are 
completely defined. The degrees of freedom for x? is (k — 1) minus 
s, the number of parameters estimated. It is necessary for the formula 
of the degrees of freedom to hold that the parameters should be esti- 
mated by the most efficient method (e.g., maximum likelihood). 

In example 3 of 4c.2 the estimates of the blood-group gene frequencies 
are found by the method of maximum likelihood. To test whether the 
proportions in the four blood-group classes can be expressed as functions 
of gene frequencies, the x? is calculated as below. Using the estimates 
p = 0.2644, q = 0.0932, and r = 0.6424, the expected values are de- 
rived. 


Observed Expected (Observed)*/Expected 
(a) 176 ne?) = 179.51 172.56 
A 182 n(p? + 2pr) = 178.18 185.90 
B 60 = n(q? 4+ 2gr) = 55.87 64.43 
AB 17 —_ n(2pq) = 21.44 13.48 
Total 435 n 435.00 436.37 
— 435 


x= 1.37 (1 D.F.) 


The x? with (3 — 2) = 1 degree of freedom is not significant, thus 
indicating that the cell expectations could be expressed in terms of the 


gene frequencies. 


TESTS OF HOMOGENEITY OF PARALLEL SAMPLES 185 


5b.3 Tests of Homogeneity of Parallel Samples 
Let the frequencies in k classes for two samples be 


Classes Total 
Sample 1 2 ve k 
First nı n2 * Nk n 
Second ny’ ne! ny! n! 


Total m +n notn? m+n,’ n+n’ 


These classes may refer to a discrete classification or to intervals of a 
continuous variable. Nothing being known about the actual distri- 
butions, how can it be tested that the two samples have arisen from the 
same population? 

If mi, +++, me and 77’, +++, Tg’ are the cell proportions in populations 
from which the samples are drawn, then the hypothesis to be tested is 
mi = mi, (i = 1, +++,’ — 1). If r; and r; are known, then the x? test 
of departure from the expected is 


3 (ni — nm)? +3 (n! — n'r)? 

nTi n'Ti 

which has (k — 1) + (k — 1) degrees of freedom. If m; = 7’, there 
are (k — 1) restrictions, and the best estimate of the common value is 
(ni + ni’) 

(n +n’) 


If this value is substituted in the above expression, x? reduces to 


Ti = Ti ~ 


a 5 (nn — ni'n)? 
nn’ n+ ni! 
— 1) degrees of freedom. This tests the departure 


which has now (k 7 


from the equality of proportions. 
nı ___ m a 
a m +m” Pa na Hng” 


and ea 
Pn +n! 
e written 


then the above x? could also b 


1 
1 = = Sa 
= oo F ni’) p? =w p Enpi — n 


186 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


which is convenient to calculate if the problem needs the evaluation of 


the p; also. 
Example 1. The distributions in four blood-group classes O; A; B, 


and AB of 140 Christians who are army cadets and 295 other Christians 
are given below. 


TABLE 5b.3a. Blood-Group Frequencies in Two Samples of 


Christians 
0 A B AB Total 
Army cadets 56 60 “18 6 140 
Others 120 122 42 11 295 
Total 176 182 60 17 435 
Pp 0.3182 0.3297 0.3000 0.3529 0.3218 


Znipi = 56(0.3182) + 60(0.3297) + 18(0.3000) + 6(0.3529) = 45.1186 


x? = 140.2069 — 140 = 0.2069 


The probability that x? exceeds 0.2069 with 3 degrees of freedom is 
greater than 95%, showing thereby that the two samples may be con- 
sidered to have arisen from the same population. 

Example 2. The test given above is not necessarily the best and is 
recommended only when nothing is known about the frequency distri- 
butions. In the above example of blood groups it is known that the 
frequencies can be expressed in terms of two independent gene frequen- 
cies (example 3, in 4¢.2). If p, q and p’, ¢’ are the parameters appropriate 
for the two samples, then the test of agreement between the two samples 
reduces to that of testing the hypothesis p = p’ and q=q. The x? 
test for hypothetical values of p, q and p’, q’ has 4 degrees of freedom, 
whereas if they are estimated subject to the conditions p = pag Sg 
the resulting x? has only 2 degrees of freedom. There are only two 
essential comparisons needed, and the x? with 2 degrees of freedom is 
sufficient for this purpose. But in the example worked out above, the 
x? test of discrepancy has 3 degrees of freedom, one more than that of 
the hypothesis specifying two relationships. The x? test with 2 degrees 
of freedom is to be preferred, and this is possible because the nature 
of distributions is known. 

In general, if the distributions are specified by r parameters, then the 
x? test for equality of parameters given in 5a.2 has r degrees of freedom. 


TESTS OF HOMOGENEITY OF PARALLEL SAMPLES 187 


If the discrepancies in the k classes arise owing to the parameters’ 
being different, then the test based on a direct comparison of the esti- 
mated parameters appears to be reasonable. 

For carrying out the proposed test it is necessary to estimate the 
gene frequencies from the totals. This is worked out in example 3 of 
4c.2. The estimates of p, q are 


p = 0.26444 ġ = 0.09317 % 


and the inverse to the information matrix (example 3 in 4c.2) per single 
observation is 


I? = 0.114430 IP? = —0.012208 
IP? = —0.012208 I% = 0.044376 


The scores p and ¢q for the two samples and x? are calculated below. 


eee: omnes 

Sample n $p $a x= er a) 
1 140 10.37497 —7.27341 0.11704 
2 295 —10.37497 7.27341 0.05554 
Total 435 0 0 0.17258 


In the last column the values of x? differ only in the multiplier 1/n so 
that the total x2 can be simply obtained from the formula 


1 1 ~ 
(— + ~) LI" $id; 
my n 
1 1 =407)2 
= (= + =) [0.114430(10.37497) 
n ç 
+ 2( —0.012208) (10.37497) (— 7.27341) + 0.044376(7.27341)°] 


= 0.17258 

ith 2 degrees of freedom is small so that the data do not provide 
any evidence for differences in gene frequencies. In fact, the probability 
of exceeding the observed value with 2 degrees of freedom is just over 
90%, which is smaller than the corresponding probability of zampi 1 
above. In general, the test given in example 2 is more sin an 
the overall test of example 1. The common estimates of p and q are 


E Si Ni tests proposed above can be extended to the general 
case of testing whether a number of samples come from the same popu- 


The x? w 


188 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


lation. When nothing is known about the distributions except that 
they are identical the x? is based on (k — 1)(s — 1) degrees of freedom 
where k is the number of cells and s is the number of samples. If nij 
denotes the frequency in the ith cell for the jth sample and n;. = Dnzj, 
n.i = Zniz, then the expected value of ni is n;. X n.;/n, when the 
hypothesis is true. The x? on (k — 1)(s — 1) degrees of freedom is 


2 
nin 
nij — —— 
n 


Ni-N.j 


p22 


i n 


The test should, however, be modified if the nature of the distribution 
is known. The following example illustrates the method. 


TABLE 5b.38. Distribution of Animals Bred for Linkage 
between Two Factors A and B 


Sex of Sex of Phenotype 
Teterozygotes Phase Animals |__| prota} 
Bred | AB Ab aB ab 
Coupling 9 12 13 11 8| 44 
(i 13 15 16 16 60 
P'o 
Repulsion [e] 11 13 13 19| 56 
fou 15 10 10 16 51 
Coupling Q 30 17 20 413 80 
g 18 18 20 24| 80 
99 
Repulsion 9 17 12 13 17| 59 
g 15 12 11 14 52 
Total 131 110 114 127 482 


First it is necessary to test whether there is sex difference within a 
mating type. The results are from backcrosses so that the probabilities 
in the four classes are (1 — p)/2, p/2, p/2, (1 — p)/2 for coupling and 


p/2, (1 — p)/2, (1 — p)/2, p/2 for repulsion. The score and information 
for the recombination fraction p are 


(AB) | (Ab) + (aB) (ab) n 
H = and I = 
1=p P Lp p — p) 


TESTS OF HOMOGENEITY OF PARALLEL SAMPLES 189 
for coupling and 
(AB) + (ab) (Ab) + (aB) n 
S= = a I = ——— 
P k= p p(l — p) 


for repulsion. The x? is S?/Z. 
To test for homogeneity of the first two samples 


9 12 12 H1 8 44 
g 13 15 16 16 60 


25 28 27 24 104 


we obtain the estimate of p from the totals. 


28427 55 
P= io 104 
ai 20 24 (49 X 24 — 55 X 20)104 
Coe See eee 
L-0% p 55 X 49 
44 102? X 44 
L 


“pa-p 55x49 
Sy (49 X 24 — 55 X 20)? _ 5776 1 


T 44X55x49 2695 44 
s2 (49X81 — 55 X 29)? _ 5776 1 
Ta 60x55x49 2695 60 


o SÊ Se? _ 5776X104 
x” = 7 t T, = 2095 X 60 X 44 


n the following four values of x? to test for sex 


= 0.0844 


Similarly, we obtai 
homogeneity within mating types. 


Probability 
Mating Type x Dw. x > x0” 
aa (©) 0.0844 1 >0.75 
ae R) 0.5575 1 >0.45 
29 (© 0.0251 1 >0.87 
92 R) 0.0389 1 >0.84 
Total 0.7059 4 >0.95 


vith 4 degrees of freedom, and the individual x”’s 


The total x”, 0.7059 v f anc all x" 
ah Ms ae of freedom have very high probabilities, thus showing re- 


190 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


markable agreement within mating types. The probability of observ- 
ing a x? less than 0.7059 with 4 degrees of freedom is less than 5%, 
which shows that such good agreement can be expected very rarely. 
This might lead the experimenter to suspect his material. In such a 
case nothing can be said for or against the offered hypothesis. For 
instance, an unconscious bias as to the nature of things to be expected 
on the part of the experimenter may result in wrong recording. 

Accepting, however, the fact that there is no sex difference within 
mating types, we may proceed to test whether any difference is caused 
by the mating types. The total frequencies are 


2 


x 
va (©) 25 28 27 24 104 0.0015 
aa (R) 26 23 23 35 107 0.7982 
992 (C) 48 35 40 37 160 2.1757 
29 (R) 32 24 24 31 111 0.7339 


Total 482 3.7093 


The common value of p is obtained by equating the total score to zero. 
25 + 24 28+ 27 
a + aed 
se p 
23 +23 26+ 35 
ai + —— 
1—p p 
48 +37 35+ 40 
e + ——___ 
1—p p 
24 +24 32+31 ° 
Spe e sat gL =0 


Si + Se + S3 + S4 = — 


or 
228 254 


1-p p 


The values of x? in the last column above are calculated for the value 
of p = 254/482. For instance, the first value is 


(49 X 254 — 55 X 228)? 
254 X 228 X 104 


= 0.0015 


The total x” is 3.7093, which is not significant on 3 degrees of freedom, 
thus indicating close agreement of the four samples. The best estimate 


PROBABILITY OF OBSERVED CONFIGURATION 191 
of the recombination fraction is 


25 


or 52.7% 


al 
toj 


thus indicating the possibility of the recombinants’ exceeding 50%. 


5c Contingency Tables 


5c.1 The Probability of an Observed Configuration and Tests 
in Large Samples 
If the individuals of a population can be described as belonging to 
one of r categories, Ay, +*+, A, with respect to an attribute A, and to 
one of s categories, B1, -+ +, Bs with respect to an attribute B, and so on, 
then we have a frequency distribution of individuals in r X s X- 
classes, a typical class being represented by A;B; ---. If there are k 
attributes on the total, the arrangement described above is called a 
k-fold contingency table. 
In this section various problems connected with two attributes are 
discussed, the treatment being similar in the general case. Let the 
observed frequency in the class A;B; be denoted by n,; and the prob- 


ability by mij. Also let 


ma + ni tet tis Sni Ta + mi bet Ti = Ti. 


mij + noj heb Nrj = MG mij + tag beet Trj = Tj 
n. +n t =n 1 Hna t =n.. 
mie Hr t =r i HTa +l 


The probability of the observed frequencies is 


Ş oyni ri 
(ma) i (mi-) al (T.j 

1 ee SE 2... —— Xn. — 
n.. VOI ‘yl 1 mel nel 


TIn;.!IIn.;! 1 mi N" 
x —— > mm ( H ) 
n..! nij! 


If mij = mir.j, then the above expression becomes 
aS Jei In; ny! 1 

(d ii (-5) x Te il oe 
k ! socom ny! Misa 


! ni! 


Ni? 


The first two expressions give the probability of the marginal totals, 


192 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


and the third gives the probability of the class frequencies for fixed 
values of the marginal totals. 
TIn;.!IIn.;! 1 
Pini | ni n) = ——— ot — (5¢.1.1) 
n..! ni! 

It is interesting to observe that the above expression is independent of 
the hypothetical values of the proportions, provided, of course, that 
the attributes are independent, i.e., that the probability m;; is the 
product of the probabilities for the ¿ith category of the first attribute 
and the jth category of the second attribute. 

In some situations, especially in designed experiments, one set of 
marginals is determined in advance. Thus, for instance, we might 
choose a number of individuals and inoculate them against an infection. 
Another chosen number of individuals could be kept as controls. Both 
groups supply the number of individuals infected and not infected from 
which a 2 X 2 contingency table can be set up. In general, if the row 
totals are fixed in advance, then, assuming the same set of probabilities 
Pi, ***, Ps for different categories in each row, the joint probability 
of the observations is 


P(n; | nz.) = II 


i=1 Ni! +++ Nie! 


ni.! 


pi” 


The probability of the marginal totals n.; in this case is 


Neal 
E aultbty 


P(n.;| ni.) = * Ds 


=e ae 
Nale nig! 


which is obtained by summing the previous expression over n.j = 

>in, forj = 1, +--+, s. Hence 

i 

P(ni;| ni.) 

P(n.j| ni.) 
TIn;.!IIn.;! 1 

= WL 


P(ni; | ni, 0.3) = 


Rival nis! 


which is the same as the expression (5¢.1.1) obtained in the general case. 


5c.2 Tests of Independence in a Contingency Table 
If the probabilities 7;; of the cells in a contingency table are assigned, 
then to test the hypothesis that the data are in agreement with these 
hypothetical values the statistic 
(mig — n. Tij)? (0 — Ey? 
v= rd =z 
77 


N..Wig E 


(over all classes) 


TESTS OF INDEPENDENCE IN A CONTINGENCY TABLE 193 


can be used as x? with (rs — 1) degrees of freedom, the only restriction 
being 2=n,;; = n... If the attributes are independent, then the cell 
probabilities satisfy the relations 


Tij = Ti-T.j ~~ for alld andj 


How can this hypothesis be tested on the basis of the observed data? 
Two situations may arise. 

(i) The hypothetical probabilities 7;. and 7.; specifying the marginal 
distributions may be known, in which case we are required to examine 
whether the cell probabilities could be constructed by the above law. 

(ii) The hypothetical proportions of the marginal frequencies not 
being known, we are required to test whether the attributes are inde- 


pendent. 
In the first problem we have the total 
(nij — N..4 3.7.5)” 
£ = R — D.F. = rs — 1 


NTT eZ 


which measures the overall discrepancy of the observed from the ex- 
pected. From this we can single out two components 


(nz. — nri)? 
2 =- SS ——————_. DF. =r-1 
a ar 
2 
ins (n.j — N.-1-3) Rasal 


j N. -T-j 
which measure the discrepancy of the observed marginal frequencies 
from the expected. With these statistics we can test whether the ob- 
served marginal totals are as expected. On subtracting x1” and x.” from 


the total, we obtain 


2 2 
xs? = x2 — x1 — X2 


2 
[nij — ete — Tylni — n-ti) — Ti (nj — n..m.3)} 


is N. Ti Tj 


It may be noted that x37 is equal to x7, x17 and x02 being zero, when the 
frequencies are subject to the restrictions 


as | = TE 
Pigs = nT = 0 2 1, , 


194 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


out of which (r — 1) are independent, and 
Nj —N..7-7=0 J=1,---,s 


out of which (s — 1) are independent. The degrees of freedom for 
x? when the frequencies are subjected to (r — 1) + (s — 1) +1 (for 
the restriction ZĒn;; = n..) restrictions are (rs —r—s+1) = 
(r — 1)(s — 1). Therefore x3? is distributed as x? with (r — 1)(s — 1) 
degrees of freedom. This component is used to test the departure from 
independence. 

As an example, consider the fourfold contingency table 


Bı Bz 
Al a b a+b 
Ag e d c+d 
a+c b+d n 


with the marginal hypothetical proportions (pı, qı) for A and (po, q2) for 
B. The total x? with 3 degrees of freedom is 


eu npıpa)? 4 Cmo)? | (e = ngpa)? p= ngqıq2)* 
npıp2 nNPpıg2 NN P2 nq 92 


The components are 


Dies (a +b — np)? 4 (c + d — nq)? _ (a+b — np)? 
npr ng npıqı 


X1 


4 atenn) y Utia (a + c — np)? 
np2 nqz NP2GJ2 


X2 


233 {a — npips — po(a +b — npy) — Pila + c — np2)}? 
NPiP2 


X3 


_ na — bpaqı — cpıq + dpıp2)? 
NPıPp29192 


with 1 degree of freedom each. 

Example 1. Bateson found the following distribution of sweet pea 
plants obtained from an intercross so that the marginal frequencies are 
expected to be in the ratio 3:1. If the two characters, flower color and 


TESTS OF INDEPENDENCE IN A CONTINGENCY TABLE 195 


pollen shape, are independently inherited, then the cell frequencies are 
expected to be in the ratio 9:3:3:1. 


Flower Color 
Pollen 
Shape 
Purple | Red 


Long 296 27 323 
Round 19 85 104 


315 112 427 


It was seen (example 1, 5b.1) that the total x? of discrepancy with 3 
degrees of freedom is 222.1221, which is very high. The first component 
is 
(a+b—np)? (823 — $ X 427)? 
x? = = i = 0.0945 
rin 427X 34Xł 


which is quite small for 1 degree of freedom, showing that the single 
factor segregation for pollen shape is as expected. The second com- 


ponent is g f 
, (a+c— np)? _ (815 — å X 427)? 


—— = 0.3443 
np292 427X EX 4 


x2 


which again is quite small. The third component is 


2 (aqg2 — bp2m — P192 + dprp2)? 
scsi npip2% 92 


2 
(296 — 27 X 3 — 19 X 3 +85 X 9) = 90116838 
497X3X3X1X1 


which is very high for 1 degree of freedom. The total x” is 
0.0945 + 0.3443 + 221.6833 = 222.1221 


i ith the total calculated earlier. It is seen that the 
ae is concentrated in one component wes a ot poke 
of freedom. This shows that the departure of the o e a a 
expected cell frequencies is due to the ES es o e om 
inherited but not to single factor segregations. The succes 


196 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


tistical tests lies in isolating such components which are most efficient 
for judging the points at issue. 

Suppose the hypothetical values of the marginal probabilities are not 
known. Then we estimate their values on the hypothesis 


Tij SGT 5 
The best estimates are 


Tz. — and T.j ~ — 


Since (r — 1) + (s — 1) parameters have been estimated, the above x? 
has (rs — 1) — (r — 1) — (s — 1) = (r = 1)(s — 1) degrees of freedom. 
At the estimated values the components xı? and xə? have zero values 
so that x3? = x°. Thus x3? measures the departure from independence. 

This test is useful in two situations: 

(i) Suppose that in example 1 above it is found that the marginal 
frequencies deviate significantly from the expected. This indicates that. 
the assigned marginal probabilities may not be correct. In fact, if the 
single factor segregations are disturbed owing to unequal viability of 
the two types of plants, then the marginal frequencies will not be in the 
ratio 3:1. In such a case the third component x3? loses its importance, 
or, in other words, the significance of x3? may be due to the use of 
wrong proportions. The best course is then to substitute the estimated 
proportions and use the test obtained above. 

Gi) The second situation is when nothing is specified about the 
marginal proportions. In this case only a test for independence is 
possible. 

It may be observed that the test of independence in a contingency 
table is the same as the test of homogeneity in parallel samples described 
in 5b.3. 

If the hypothetical proportions are not known in the above example, 
then the x? for testing independence is 


fa — (a +0)(a +0)/n}? {b — (a+b) + d)/n}? 
(a + b(a + 0)/n (a +) F d/a 


TESTS OF INDEPENDENCE IN A CONTINGENCY TABLE 197 
which reduces to 


n(ad — bc)? 427(296 X 85 — 27 X 19)? 


= = 269.3095 
abcd 315 X 112 X 323 X 104 a 


This gives a x” higher than that obtained by using the hypothetical 
values of the marginal proportions. Such discrepancies will not in 
general lead to contradictory conclusions. The earlier test makes use 
of the information supplied by a total of 427 plants in testing for inde- 
pendence, whereas the latter makes use of the information supplied 
by the set of configurations having the same marginal totals. Some 
marginal totals, as in the present case, are more informative than the 
average, whereas others are less. 

Example 2. The following data give the number of skulls excavated 
in three different seasons and the sex distribution as sexed by investi- 
gator A working in the first two seasons and by B working in the third 


season. 


Seasons 
Total 
First | Second} Third 


g 162 180 210 552 
fo) 110 125 200 435 


Total 272 305 410 987 


Let us assume that in each season the excess of males is due to a random 
deviation from the expected equal numbers for the two sexes. The ex- 


lues are 
pected value k i36 152.5 205 


ę 186 152.5 205 


giving a total of x? = 9.9412 + 9.9180 + 0.2440 = 20.1032, a high 
value for 3 degrees of freedom. Individually the deviations in the 
first two seasons (xX? = 9.9412 and 9.9180) aro penioant 5 wis oe 
ratio is 1:1 on the total, the 2 resulting from the marginal is 


2 
(552)? + (485)" — 987 = 13.8692 
493.5 493.5 


which leaves x2 = 20.1032 — 13.8692 = 6.2340 with 2 degrees of free- 


198 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


dom for testing the differences in the sex ratio in the three seasons. 
This is no doubt significant, showing differences in sex ratio, but the 
test is not strictly valid owing to the fact that the marginal totals are 
not compatible with the sex ratio 1:1, the x? = 13.8692 being significant 
for 1 degree of freedom. Having observed that there is an overall 
discrepancy from the sex ratio 1:1 in all three seasons put together, we 
might ask whether the sex ratio is the same for the three seasons although 
it may not be 1:1. A straight test of independence for fixed marginals 
or homogeneity of parallel samples (5b.3) can now be calculated. This 
gives x” equal to 6.3222 with 2 degrees of freedom, showing significant 
differences in sex ratios in the three seasons. The agreement of this 
x? with the earlier value of 6.2340 obtained by using the hypothetical 
ratios is, perhaps, accidental. 

One must be careful in drawing conclusions from data of this nature. 
It must be observed that the skulls are sexed by a subjective method 
of anatomical appreciation. Different investigators have different meth- 
ods of sexing, leading to different ratios. The observed discrepancy in 
the sex ratios for different seasons may be due to the investigator in the 
third season being different. The observed proportions are 


Season First Second Third Overall 
Proportion 0.5956 0.5902 0.5122 0.5593 


The same investigator working in the first two seasons shows a smaller 
proportion in the second though not significantly different from the 
first, the x? with 1 degree of freedom being 0.0175, But it is just as 
well that he thought that his method gave an excess of males in the 
first season and tried to alter his method consciously or unconsciously 
in the second season. The discrepancy between the investigators is then 
tested by the x? test of independence from the table: 


First and Second Third 


Seasons Season 
Ga ~ 842 210 
g 235 200 


The x? with 1 degree of freedom is 6.3055, which is significant. Thus, 
out of a total x? of 6.3222 with 2 degrees of freedom, measuring the 
differences in sex ratios, 1 degree alone accounts for 6.3055, which shows 
that the whole discrepancy between seasons is due only to the dis- 
crepancy between investigators. This might indicate a difference in the 
method of sexing or that the investigators’ proportions referred to dif- 


| 


D2; q2, "2 are the marginal propor 


TESTS OF INDEPENDENCE IN A CONTINGENCY TABLE 199 


ferent strata from which the skulls are excavated and there might be 
stratum differences. 

It should also be observed that the deviation of the overall sex ratio 
from 1:1 may be due to a wrong technique of sexing. 

Example 3. Consider the following data, collected from a number 
of schools, regarding speech defects (S1, S2, S3) and physical defects 
(P1, P2, P3) of school children. 


Si Se B 
Py 45 2 12 83 


81 86 50 217 
The expected values on the hypothesis of independence are: 


30.982 32.894 19.124 
38.447 40.820 23.733 
11.571 12.286 7.143 


The x? with 2 X 2 = 4 degrees of freedom is 34.8828, which is significant. 
It is seen that, although the frequency in one cell is as small as 4, the 
expected is large enough for the x? approximation to hold. But the 
frequency also should be large enough for the approximation to be good. 
In such cases it is reasonable to combine two cells by adding their 
frequencies and treat them as one cell for purposes of tests of significance. 
In the above example 4 and 10 may be added to yield an observed 
frequency of 14 with the corresponding expected 11.571 + 12.286 = 
23.857. The new x? is 33.5763. Although the summation is now taken 
theoretically 1 degree of freedom is not lost. So to 
use the new x? with (4 — 1) =3 degrees of freedom is to overestimate 
significance, whereas to consider it as with 4 degrees of freedom is to 


underestimate significance. Although definite conclusions can be drawn 
either when the new x” is not significant for 3 degrees of freedom or when 
it is significant for 4 degrees of freedom gann the present case, it is not 
possible to say anything when the new x“ 1s significant for 3 degrees 
and not for 4 degrees of freedom. In such a situation, for the purpose 
of the x? test a new set of expected values may be obtained by considering 


mpe : E 
P, as constituting a single cell. If py, gu 7 ani 
“lables fete tions for physical defects and speech 


lity of the observed frequencies on the hypothe- 


over one cell less, 


defects, then the probabi 
sis of independence is 


17 
const. (p1p) (p10) Prd) en (po + q2)r1}* (rire) 


200 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 
The maximum likelihood estimates are 
217p, = 83 217q, = 103 217r, = 31 
217re = 50 76p2 = 77q2 
which give 
p2 = 0.387308 q2 = 0.382278 T2 = 0.230414 


The estimates of pı, q1, rı are the same as before. The new expectations 
are 

32.147 31.729 19.124 

39.893 39.375 23.733 

(12.006 + 11.851) 7.143 


and the x? with (7 — 4) degrees of freedom is 31.2472. This test is 
valid in the sense that when significance is noted the hypothesis of 
independence is rejected. 

Fisher recommended a test based on likelihood, which is more appro- 
priate when the cell frequencies are small. This is defined by 


O 
L = 220 log — 
E7 


which in tests of independence in a contingency table can be written 
L = 2{22n;; loge ni; — Eni. log, ni. — 2n.; loge n.j + n.. log, n..} 


In this case L is approximately distributed as x? with (r — 1)(s — 1) 
degrees of freedom. The value of L in the above problem is 30.4448, 
which is significant on 4 degrees of freedom. Even the use of L requires 
a large sample, in which case x? and L tend to equivalence and there is 
no theoretical justification of one in preference to the other. In small 
samples the statistic L may be more appropriate, but it cannot be used 
unless its distribution is known. Therefore some such technique as 
that followed above may be used. It must be emphasized that the 
object of the test is first to establish departure from independence in a 
general way. For this it is enough to use a valid test which is simple 
to compute. Afterwards more refined tests may be used to examine 
some portions of the contingency table. For instance, we may inquire 
whether the two physical defects P and P} and the speech defects Sı 
and Sg are associated. This needs a refined technique discussed in the 
next section. 


5c.3 Tests of Independence in Small Samples 


In testing for any hypothesis specifying some relations satisfied by 
the parameters it is seen ae the exact values of the parameters esti- 


TESTS OF INDEPENDENCE IN SMALL SAMPLES 201 


mated do not enter into the large sample distribution of the x? statistic. 
But in small samples it might happen that the x? approximation breaks 
down and/or the unknown parameters appear in the exact distribution 
of x. In the latter case no exact test of significance is possible, owing 
to the presence of the unknown parameters in the probability distri- 
bution. Such unknown parameters are called nuisance parameters. 

One way of getting rid of the nuisance parameters is to compare the 
particular observed sample, not with the whole population of samples 
with which a comparison might be made if the exact values of the 
nuisance parameters were known, but with a subpopulation selected 
with reference to the sample in such a way that the distribution of a 
statistic in this subpopulation does not involve any unknown parameter 
(Bartlett, 1937; Hotelling, 1940). For instance, it is shown that on the 
hypothesis of independence of two attributes the probability of cell 
frequencies, given the marginal totals, is 

IIn;.!IIn.;! 1 
P(nss| OR ni) = n..! om ni! 

which does not contain the hypothetical values 7;. and m.j. The distri- 
bution of x2 may then be found, using the conditional distribution of 
the cell frequencies. This admits the possibility of determining the 
exact probabilities in tests of independence. 

Consider the fourfold table with frequencies a, b, c, d. The prob- 
ability of the observed configuration, given the marginal totals, is 


! Ib ! d)! 
(a+d)'(a+elG+ Alc + d) (50.3.1) 
alb!c!din! 
If a, b, c, d are considered as four independent Poisson variates having 


the joint probability 


b c d 
-mı mr oom oom 28 eo (52.3.2) 
äl b! c! d! 
where (a +ba +o) 
mı = z 
(b + d)(a +b) 
m = aa 


(a +o) (c +d) 
n 
¢tJera 


n 


m3 = 


m = 
s 


202 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 
then the probability of the totals (a + b), (a + c), (b+ d), (c + d) is 


e` m+ mz+ms+m4) ( a y (* + T K + su ( F -g 
n n n n 


s1111 50.3.3 


where the summation is taken over constant values of (a + b), (a+ c), 
(b + d), (c + d). Since 


LLLI n! 
albleld! (a +b)(a +o)! + de + d)! 


the probability of a, b, c, d for given values of (a + b), (a + c), (b +d), 
(c + d) is the ratio of the expression (5¢.3.2) to (5¢.3.3), 


(a + b)'(a + c)! +d) e+ a)! 


nialb!eld! 


which is the same as the expression (5¢.3.1). Thus the probability of 
a given configuration in a contingency table for given marginal totals 
may be considered as a relative probability of four Poisson variates 
subject to three independent restrictions. If the values of Mı, M2, M3, 
mg or the expected cell frequencies are large, the Poisson distributions 
tend to normality, in which case the x? statistic is distributed as x 
with 1 degree of freedom. On the other hand, if the expectations are 
small the continuous distribution of x2 cannot be used. In such a case 
a direct approach is to calculate the sum of the probabilities of the 
observed and the less probable configurations and to reject the hypothe- 
sis if this sum is small (either below 5 or 1%). The following illustrations 
explain the method. 

Example 1. Do the following data on sociability (S) and nonsocia- 
bility (NS) of soldiers recruited in cities (C) and villages (V) suggest 
that city soldiers are more sociable than village soldiers? 


13 4 17 
g 6 14 20 
19 


18 37 


The smaller frequencies in one diagonal suggest that city soldiers are 
more sociable. But it must be ascertained whether such a configura- 
tion as the observed and those indicating a higher degree of sociability 


TESTS OF INDEPENDENCE IN SMALL SAMPLES 203 


of the city recruits can occur by chance if in fact there was no difference 
in the sociabilities of the city and the village recruits. Since for fixed 
marginals the probability of a given configuration a, b, c, d in the four 
cells is 
17!20!119!18! 1 
37! alb!c!d! 


we find that the probabilities for configurations with 4, 3, 2, 1, and 0 
in the north-east corner cell (these being less favorable to the hypothesis 
of independence and more to the alternative suggested) are, respectively, 
0.025218, 0.075966, 0.0'3607, 0.01097, and 0.071075, adding up to 
0.0059. The chances are very small, thus indicating that city soldiers 
are more sociable than village soldiers. 

If the cell frequencies are not small, this result could be established 
by calculating x? for testing independence and determining the prob- 
ability of a normal deviate with zero mean and unit variance exceeding 
x (example 2 in 5b.1). In the above case 
>  37(13 X 14-6 X 4)? 


it 13 X 14X 6 X44 


x = V7.9435 = 2.8181 


so that the normal probability is 0.0025, which is smaller than the actual 
value 0.0059, the discrepancy being due to smallness of the sample. ; 
Yates suggested that by calculating x” from a table obtained by in- 
creasing the smaller frequency * by 14 without altering the marginal 
totals a closer approximation to the actual probability is realized. In 
the present example, the new xê, said to be corrected for continuity, 
is 6.1922. The value of x = 2.4884, so that the normal probability is 
0.0064, which is closer to the actual value than in the case of the un- 


= 7.9435 


m 2 
wets gested by V. M. Dandekar involves 


A slightly different method sug 
the prin Be of x02, X-17) and xı? for the observed configuration ae 
ai i i i frequency by 
d by increasing and decreasing the smallest 
geen x2 can be obtained from the formula 


unity. From these a corrected 


2 2 
Xor TA 
( 


E a 2 
7 = (x17 — x0") 


2 = 
Xe Xo we -F 


agonal (6,4) under considera- 
In the general test of inde- 
smaller xê. 


aller frequency in the di 
of the village recruits. | 
frequency 50 as to obtain a 


* The reference is to the sm 
tion indicating nonsociability 
pendence 1 is to be added to & 


204 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 
In the present example xo? = 7.9435, x1? = 12.0995, x1? = 4.6587, and 


7.9435 — 4.6587 
Xe = 7.9435 — (12.0995 — 7.9435) = 6.1086 


12.0995 — 4.6587 
WV xq = 2.4715 


The normal probability is 0.0068, which is also close to the actual value. 
The likelihood test 


O 
L = 220 log, — 
E 


in tbis case gives the value 8.2811. The value of xis V8.2811 = 2.8778 
so that the probability is much smaller than the actual value. Thus 
the likelihood test does not improve the situation. 

Example 2. In example 1, the object of investigation was to study 
whether city soldiers are more sociable than village soldiers. This 
necessitated consideration of the deviations from the expected in one 
way only. But, in general, if the object is to discover association be- 
tween two attributes without specifying the nature of the association, 
it is necessary to consider all possible deviations from the expected. 
Thus, if a plant can be classified with respect to one of two flower colors 
and one of two pollen shapes, we pose the question whether the pollen 
shape and flower color are independent. In such a case we have to 
determine the total probability of the observed configuration and those 
less probable than this. Let us consider the same data as in example 1. 
The configurations less probable than the observed and indicating asso- 
ciation in one way have the values 3, 2, 1, and 0 in the north-east cell 
of the table. The probabilities of these configurations and the observed 
have been calculated in that example, and they add up to 0.0059. The 
configurations less probable than the observed but indicating association 
the other way are those which have 4, 3, 2, 1, and 0 in the north-west 
corner and their probabilities are, respectively, 0.072088, 0.031864, 
0.08733, 0.0°1828, and 0.081132, adding up to 0.0023. The total 
probability is then 0.0059 + 0.0023 = 0.0082, which is small, thus indi- 
cating departure from independence. 

If the sample is large, we can find the probability by directly entering 
the uncorrected x” in a x” table with 1 degree of freedom. In the present, 
example x” = 7.9435 with the associated probability about 0.005, which 
js smaller than the actual value 0.0082. 

Using Yates’ correction for continuity, the x? = 6.1922, with the 
corresponding probability 0.0128, which is higher than the actual value. 

To extend Dandekar’s correction to this case we first note that the 


TESTS IN POISSON POPULATIONS 205 


x? values below and above the observed x? = 7.9435 are 6.0598 and 
9.7448, corresponding to the partitions 


5 12 4 13 
and 


The corrected x? is 
$ 7.9435 — 6.0598 
x? = 7.9435 (9.7448 — 7.9435) 
9.7448 — 6.0598 


= 7.0228 


which gives a probability 0.0082, almost exactly equal to the actual 
value. In general, Dandekar’s correction is slightly better than that 
of Yates, although the correction is simpler in Yates method. In 
testing for linkage on the basis of data classified according to two 
factors, it is enough to test for association one way if it is known that 
the recombination fraction is less than 14. It is now known that the 
recombination fraction can exceed 14, as demonstrated by Fisher. So 
it is better first to disprove the hypothesis of independence without 
inquiring as to the nature of association. Further it must be noted 
that departure from independence may occur owing to various causes 
in experimental data, and it is better to use a test which gives a direct 
appraisal of the data as to its compatibility with the hypothesis of 


independence. 
5d Tests in Poisson Populations 


It was shown in 2a.3 that the probability of k independent Poisson 
variates can be written as the product of 


PX +e Xe) Se 


and (2X)! Ey 5 Hig 
P(X, +++, Xe | ZX) =F Be Sy 


mial probability. Testing whether the obser- 


i ion i ivalent to 
i om the same Poisson population is equiva 4 
Her wae epi +--+ Xr) trials -the frequencies 


testi in a series of (X1 X;) trials “th 
x 3 oe arise from a multinomial distribution with aue pr D 
abilities in the k% classes, since (p:/2p) = ve er - ati ae 
instance, to test whether the observations 6, 9, ae 


i i t explained 
the same Poisson population the tes d 
could be carried ont: If the individual values are not small, then the 


HI tt THE 


the latter being the multino 


206 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


x? test of departure from expected values could be used. The expected 
value is X = DX/k, corresponding to each X;, so that 
x x 

This has (k — 1) degrees of freedom. We can also test any hypothesis 
specifying the proportions M; = y;/Zy. This is equivalent to testing 
whether the frequencies could have arisen from a multinomial popula- 
tion with proportions M, ---, dy, so that the x? with (b — 1) degrees of 
freedom is 


2 _ y (Xi ~ MEX)? 
AX 


Example 1. Four samples of sizes 120, 100, 100, 125 from Poisson 
populations gave the following mean values: 251/120, 323/ 100, 180/100, 
426/125. Do the populations have the same mean values? 

It is seen in 2a.3 that the sum of n independent Poisson variates is 
also a Poisson variate with mean value equal to n times the mean of 
the original population. Considering the sums observed and assuming 
equal mean y for all the four populations, we have the following. 


Total 
Sum 251 323 180 426 1180 
Expected 120 100, 100 n 125m 445p 


The test for equality of mean values is equivalent to testing whether 
the sums are in the ratio 120:100:100:125. The expected distribution 
of 1180 is 

318.20 265.17 265.17 331.46 


so that x? = 81.12, which is significant for 3 degrees of freedom. The 

mean values cannot be considered to be the same for all populations. 
Example 2. Obtain the distribution of entries in a two-way table for 

fixed marginals if the observation in the (é,j)th cell is regarded as a Pois- 

son variate with mean value 7;8;. Assume that 57; = r and 2B; = B. 
The joint distribution of nij, (i = 1, «++, r; j = 1, -- +, 8), is 


gari gyni: aes 
e221 TT (riB) = ney (7:8) X e- iq] (7B;)"* 
Nij: n;.! n.;! 
s 1 2 se 
‘ ee (7B) n..'TTIn,;! 


n..! 


TRANSFORMATION OF STATISTICS 207 


The last expression is the desired probability, the first three representing 
the joint probability of the observed marginals. The relative probability 
is the same as in a contingency table with two independent attributes 
and fixed marginals. In any problem a test of independence can be 
carried out to test the hypothesis that the mean of the Poisson popu- 
lation for the (z, j)th cell can be written as the product of two parameters 
specific for the ith row and the jth column. If this is true, then the 
marginal totals can be used to test whether 7; are identical or 8; are 
identical. 

In an analysis of randomized blocks, when the plot yields can be 
considered to be Poisson variates it seems reasonable to set up the 
product hypothesis 7;8;, 7; representing the treatment effect, and £; the 
block effect. The adequacy of the product hypothesis can be tested 
before testing for treatment differences. 


5e Transformation of Statistics 


5e.1 A General Lemma 

Let the joint distribution of the statistics Tı, --+, Te tend to the 
k-variate normal form with mean values 1, +-+, and dispersion 
matrix n~! (eij), where cij are finite and n is the sample size. This 


means that the variables Vn(Ti — 91), +>, Vn(T, — 0) are in the 
limit distributed as a k-variate normal distribution with zero mean 


values and dispersion matrix (sij). 
Lemma: If f(T1, s T,) is a continuous function with continuous 
first partial derivatives then the variable 
u = Valf(Pr, e Te) — $01, +1 OW] 
ally in the limit with zero mean and variance 
af df 
Dreij vy 20; 


continuous partial derivatives in the neighbor- 
value theorem we get 


is distributed norm 


Since f(71, «+, Tx) hase 
hood of 6, +++, 9%, expanding by the mean 


(et) 
NTa 1, Te) = I Ors «2° a) + lT: — 8) (ag t 
where n; — 0 as T; > bi. Now 


F i 
E = VaB(ls an 


U— 


208 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


so that u and VWn=(T; — 4;)(0;/00;) have the same limiting distribution 
if Vnd(T; — 9:)n; — 0 stochastically. To prove this it is enough to 
show that Vn(T; — 6:)n; — 0 stochastically for all 2. 


P{| Vn(T; — 0m: | < e} 
> P{| Va(T: — 6)n:|<e and MET 
> P|| VaT: = 0)| < e/€ and |n|<e} 
> P{| VaT: — 0) | < 6/6} — Pilm] > e} 


Since Vn(T; — 0, is normally distributed in the limit, P{| Vat; — 
6;)| < A} can be made greater than (1 — ô’) for any 6’ by choosing A 
and n sufficiently large. Also since Ti > 0; stochastically and n; — 0 
as T: — 0; mathematically, it follows that P{| m| >} <8” for 
large n, however small ¢ may be. If now e is chosen such that ¢/e’ = 
A, then 


P{| Vn(T: — 6) | < ¢/e'} = P{|n:| > e} >l- ð- >1—ő 


since 6’ and 6” are arbitrary. This proves the required result. The 
statistic 2(T; — 6,)(af/00,), being a linear function of statistics which 
tend to be normally distributed, is itself normally distributed in the 
limit with 


E fzer; = 4) 2) = 207, - 0) Z0 
00; 00; 


and 
ð, E a: 
¥ {eer — 6) a =~ 22 ae 
00; n 00; 00; 


Therefore f(T, --+, Ty) has the asymptotic mean and variance 
(A, merg r) and -EX —— oj (5e.1.1) 


As a particular case of this lemma it follows that, if T is asymptotically 
distributed normally about 6, then any function F(T) of T is asymptoti- 
cally normally distributed about F(6) with variance (dF /a0)? (6) 
where ¥(9) is the variance of T, provided that dF /dT is continuous in 
the neighborhood of 6. 

If T is an efficient statistic, then 1/y(0) = I (6), where I is the informa- 
tion, in which case F(T) as an estimate of F(@) has the asymptotic 
variance {F"(0)}”/I, which is the minimum attainable. Therefore F(T) 
is efficient as an estimate of F (6). "| 


SQUARE ROOT TRANSFORMATION OF POISSON VARIATE 209 


In some cases (0), the variance of T, may be independent of 0. 
Otherwise it may be necessary to transform the statistic T such that 
the new statistic has an asymptotic variance independent of 6. Let 
F(T) be the transformation needed; then 

V{F(L)} = (F'O)}V@ 
On equating this to a constant, the following differential equation is 
obtained. 


aF TR. 

do V4) 
= cdo 

i ‘j WWO 


This result is applied in deriving the following transformations. 


5e.2 The Square Root Transformation of the Poisson Variate 
If x is a Poisson variate, then 
E(@z)=n V()=2 
The functional form of the transformation is supplied by 


c dp 


F(u) = NG 
hi 


by choosing c suitably. The transformed variable ~/x has the asymp- 


totic mean and variance 
vu and + 


when pis large. It was found by Anscombe (1948) that the transforma- 


tion Vx + b where b is a suitably determined constant has some theo- 
retical advantages. Let (s — u) = t and (u + b) = p'; then by using 


Taylor’s expansion we find 
t BN? 
Vr+b= Vii {tba a(Ż) te] 


where nS 
a = COREE ye ar ae 


Observing that the Poisson moments are 
B® =0 E =u E@)=2 E= +u 


210 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 
we find, by taking expectations of both sides of the expansion 


1 24b — 7 
EV a+b = Vpt = — E+ 


sva evn 
1 3— 8b 32b? — 52b + 17 
VVa+ = {1+ + : +] 
4 Su 32u 


which, on choosing the value b = 3%, reduce to 


1 1 
EV Lay a 
s+ u+% va oua" 


V(Va+2) 304+ c5+-) 


16x? 


The variance of Vz + 3% is more stable than that of Vx because the 
second term in the expansion of the variance of Vz + 34 is O(1/n?). 


5e.3 The Sin’ Transformation of the Binomial Proportion 


The binomial proportion r/n has the mean value r and variance 
a(1 — 7)/n. The transformation is obtained by solving the equation 


cVn 
F(x) |W" 


= sin Vr choosing c suitably 
: r 1 i 
yV (sin £ =— approximately 
n 4n 
It is shown by Anscombe (1948) that a slightly better transformation is 


3 
A ETE 


which has the asymptotic variance 1/(4n + 2). If n is large, the sim- 
pler transformation sin, r/n can be used; for moderately large 
sample sizes the refined transformation sin—!-V (r + 34)/ (n + 34) may 
be used. 


SIN! TRANSFORMATION OF BINOMIAL PROPORTION 211- 


Example. R. A. Fisher (1949) found the following recombination 
fractions between undulated and agouti loci in house mice. The data 
relate to backcrosses so that the estimate of the fraction is the ratio 
of recombinants to the total offsprings. 


TABLE 5e.3a. Recombination Fractions Observed for Twenty 
Classes of Heterozygous Parents * 


afat «Ya Afa Ava ala 
2 ae 5 12 10 2 
Tor Tis rss 7 TS 
T Iis ors eo tS ch 
Alad Ala Ad Aa a'a Total 
9 sas zb oper orm wes) rir 
* The number in the denominator gives the number of animals contributing to the 
ratio. 
Heterozygote AYA” 
Old 
Recombinations Combinations Total 
9 12 182 194 
a 9 119 128 


The data for any heterozygote supply a test with 1 degree of freedom 
for sex difference in the recombination fraction. The (independence) 
xX? for sex difference is 0.0905. Similarly, we obtain the individual x 
for each of the ten types of heterozygotes and obtain the sum 4.827 
which has nearly 90% probability for 10 degrees of freedom, indicating 
no sex difference. If we ignore sex differences, then on pooling the data 
over sex we obtain a contingency table for horerozygates and the nature 
of combinations (old or new). This supplies ax = 16.315 with 9 degrees 
of freedom showing significant differences 1n the recombination fractions 

Y i zygotes. . k 
k -n vant suitable for further analysis on sex differences. 
Suppose that it is found that sex differences exist in all the ten types 
of mating. ‘Then the further problem arises as to whether the sex 

: That is, we need to test whether 


i i me in all the cases. 
arrek between sex and the nature of the heterozygote. 


i i d then applying 
i the angular transformation an : 

= ee Herion A Sorre to each observed proportion p, an 
analysis of variance. Eas 


+2 

P h that p = sin 3 

a Deg Ar eer and Yates tables, then ¢ has the variance 
in Tal 


212 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


8100/n7? or approximately 820.7/n. The 20 angles and the necessary 
computational steps are as given below. 


AX Ar AYA Ad AYa Ata 


ọ 14.4 11.8 13.0 15.1 9.2 

do 15.3 15.4 11.2 14.9 10.4 

d = difference =0.9 —3.6 1.8 0.2 —1.2 

w = nyno/(ny + nə) 77.12 60.93 95.19 91.20 57.16 

dw —69.41 —219.35 171.34 18.24 —68.59 

dw = d(dw) 62.47 789.65 308.41 3.65 82.31 
Alat Ata Aat Aa a'a 

2 12.7 7.9 11.9 12.2 15.2 

foil 10.4 9.6 14.2 7.9 13.5 

d 2.3 S17 —2.3 4.3 Lay Total 

w 98.14 85.42 112.16 83.98 95.32 856.62 

dw 225.72 —145.21 —257.97 361.11 162.04 177.92 


dw 519.16 246.86 593.33 1552.79 275.47 4434.10 


The x? with 10 degrees of freedom for testing sex differences in all the 
ten types is 
4434.10 + 820.7 = 5.40 


This is slightly above the value 4.827 obtained earlier by a direct x? 
analysis. From the 10 degrees of freedom x? we subtract the x? with 
1 degree 


(Zdw)? 


+ 820.7 = (177.92)? + (856.62)(820.7) = 0.05 


due to overall sex difference. The residual 
5.40 — 0.05 = 5.35 


is the x? with 9 degrees of freedom for testing whether sex difference 
depends on the type of the heterozygote. This interaction x? is not 
significant, nor is that due to sex difference. In such a case, the differ- 
ences in the various types of heterozygotes can be studied by summing 
over sex. To complete the analysis, however, we determine the x? due 
to differences in heterozygotes, eliminating sex difference, 

The total x? with (20 — 1) = 19 degrees of freedom is 


(Zne)? 
En 


[zne = } + 820.7 


SIN-! TRANSFORMATION OF BINOMIAL PROPORTION 213 


the summation extendin: = 
>, g over . e follow t 
t the 20 angles. Th owlng computa: 


2 _ Ene} 
Zne — —— 
f En 
En Ine Eng? = 820.7%? x? 
9 1731 21,470.9 274,666.21 8,346.43 10.17 
g 1843 22,699.4 290,576.08 10,997.81 13.40 


Overall 3574 44,170.3 565,242.29 19,851.02 23.58 


The total of x? for Q and & 
10.17 + 13.40 = 23.57 


has 18 degrees of freedom. Subtracting the interaction component of 
5.35 on 9 degrees of freedom, the residual x? with 9 degrees of freedom 
for testing differences in heterozygotes eliminating sex is 


23.57 — 5.35 = 18.22 


which is significant. This can also be calculated in a slightly different 
way. The sex x2, ignoring differences in heterozygotes, is obtained as 


follows. 
x 21,470.9?  22,699.4? 4,170.3” 
820.6 = — 781 1843 3574 
x7? = 0.01 


obtained from ng and Xn over all heterozygotes. 
s obtained by subtracting from the total 
tion sum of squares. This leads to the 
22, the same as before. 


the values being 
The valid x? for heterozygotes i 
the above value and the interac 
value, 23.58 — 0.01 — 5.35 = 18. 


Taste 5¢.38. Analysis of x” 


Degrees of 
Freedom 
Sex 1 0.05 
Heterozygotes 9 18.22 
Interaction 9 5.35 
Total 19 23.58 


of freedom is significant, showing overall 
nents of x? do not add up to the total 
on different numbers. If the inter- 
test will be to calculate the 


The total x2 with 19 degrees 
differences. The various compo 
because the proportions are based on d 
action component is high, then a precise 


214 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


mean x” for heterozygotes and interaction 


18.22 5.35 
2.024 “ta 0.594 


and determine the variance ratio 


with 9 and 9 degrees of freedom. This is significant on the 5% level. 


5e.4 Other Useful Transformations 
The estimated variance 


Ez: = 7)? 
n—1 
where 2, ***, Z, aren vendo observations from a normal population, 


has the expected value g? and variance 2o*/(n — 1). The transforma- 
tion needed to make the variance independent of ø is 


— ta 
Flo) -|A Vn io 
= log o? choosing c suitably 
2 
Vilog s) = at approximately 
Fic 


The estimated correlation from n pairs of observations from a bi- 
variate normal distribution has the asymptotic variance 


(1 — p?)? 
n-1 


The necessary transformation is 


ji 


Flo) = 
(0) lag 


dp 


tanh p choosing ¢ suitably 


V(tanh™ r) ~ 


n—1 


The uses of these transformed statistics with slight refinements will be 
discussed in Chapter 6. 


VARIANCES AND COVARIANCES OF RAW MOMENT STATISTICS 215 


5f Large Sample Standard Errors of Moments 


5f.1 Variances and Covariances of Raw Moment Statistics 

Any population may be defined with respect to a number of frequency 
classes with probabilities mı, +-+, m and the corresponding values of a 
variate tı, --+, vA continuous distribution can be considered to 
contain an infinite number of class intervals, each of length dz, the 
differential element. In such a case the raw moments in the population 
are . 

Vp = T181 Pb TET” p= Ly 2h se 

If ni, no, +++, Mx, (Zn: = n), are the observed frequencies in the k 
classes, then the rth sample moment about the origin is 


(may? b+ + ee”) 


r 
n 


which is a linear function of the frequencies. If Os is the sth sample 
moment, then using the results (2a.9.1) we find 


E(O,) = Eriti = vr 


vo si 
V(O,) = = Aa m — (22; T)’ } 


2 
Vor — Yr 


n 
1 

cov (On, Os) = = {Zrii mi — (Derr) (Zr r:)} 
n 


Vets — Pr¥s 


n 


pulation mean, then in the above expressions the 


If the origin is the po 1, the 

es about an arbitrary origin will be replaced by uy, the popu- 
PR 

lation rth moment about the mean. 


The rth moment about the sample mean is 


f j s > APE E pi y S 
m, = Or — (") 0,101 + () O,—201 (-1)’01 


s his is invariant for origin, we can consider O, to be the sample 
ince this is mva 


216 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


raw moments about the population mean as the origin. We now use 
formula (5e.1.1) for determining the asymptotic variance of m,. 


om, ðm, r 
m e r =) or 
00, a0; S. 
ôm, 
00; 


Using the approximation 


= —70;-1 + () O,—-2(201) — +--+ 


mr — ur ~ (Or — ur) — THr—1(O1 — p1) 
since all the other derivatives vanish at the expected values, we find 
V(m,) = E(m, — u)? ~ V(O,) — 2ru,—ı cov (O, — p) (O1 — m) 
l + 7°y?,_1V (01) 
1 
a {wor — m? — rurik + 72u?,1uo} 
Similarly 


1 
cov (m, Ms) = T {ur+s — Hrs + Sports — THr—1ts+1 — SHr41Hs—1} 


As special cases we obtain the large sample variances: 


1 
V(mə) = — (u4 — n2?) 

n 

i z š 
V (m) = = (ue — u3? — uoua + 9u?) 


1 
V (m4) = z (us — Ha? — Susug + 16uou3’) 


V(0,, about the origin) = Ë 
n 


the last formula being exact. 
Example. The variance of the coefficient of variation 100V m2/01, 
where 0; is the average about the origin, is 


a Ma — be? H2 H3 
ah goat te eT 
n \ 4u H Moly 


where 7 is the population value 100V uo/p’. 


AN ILLUSTRATION OF THE P) TEST 217 


5f.2 Large Sample Tests of Difference between Means and 
an Illustration of the P, Test 
Let #1, Ēą be the mean values and s,”, s:” the estimated variances in 
two samples of sizes nı and ne, respectively. The standard errors of 
#, and & are Vs,2/m, and Vs"/n2, and hence that of zı — f> is 


V's1?/m, + s2/n2. If nı and ne are large, the statistic 


Tı — T2 
w= 3 

S S2 

m Ne 


can be used as a normal deviate with zero mean and unit standard 
deviation to test the hypothesis that the samples are drawn from popu- 
lations having the same mean, nothing being specified about the varia- 


tions. 
It is not necessary that the origina’ 
that the samples are large enough. How large the sample should be 


depends on the nature of the populations. Populations with highly 
skew or multimodal distributions require very large samples. 

In a feeding experiment with pasteurized and unpasteurized milk, 
Elderton (1933) found the following values of w and the associated 
normal probabilities. The samples were large so that the difference in 
mean statures divided by the standard error of difference could be used 


as a normal deviate. 


1 populations be normal, provided 


Values of w and Probabilities for Each Age Group 


TABLE 5f.2a. 
A Observed Probability 
of ar w = Wo P(w > wo) _ login P 

634 2.69 0.0035726 3. 5529844 
734 —0.71 0.7611479 T.8814690 
KS 1.24 0.1074877 1.0313588 
a 1.84 0.0328841 2.5169849 
ae 1.06 0.1445723 T. 1600851 

A i 
Total 6. 1428822 


= —5.8571178 


It is found that the probabilities are less than TORE rel ta 

Should we then say that pasteurized milk is beneficial p e a a 

groups 634 and 934 and not for the other age groups? £ o m 

deviati i f w) in 4 out of 5 cases indicate a taller stature for boys fed 

f on a he n some cases, may be due 
n pasteuriz ) 


d the nonsignificance, 1 e 
to the inadequacy of the numbers or to the presence of high variance 


218 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


in the stature of boys chosen for the experiment. On the other hand, 
it may be argued that 2 significant cases out of 5 could arise by chance, 
even when pasteurized milk does not affect the stature of boys. In 
such cases the evidences supplied by the various groups have to be 
combined to answer the problem of differences caused by feed. This 
can be done by calculating the statistic 


Py = —22 log. Pe 


ll 


5 
—2 log. 10 >> log 19 P; 


t=1 
—2(2.3026)(— 5.85712) = 26.9732 


which is distributed as x? with 5 X 2 degrees of freedom, as shown in 
2a.8. In the above problem x? = 26.9732, which as x? with 10 degrees 
of freedom is significant on the 1% level. This shows that pasteurized 
milk causes some difference in stature in general. It is difficult to say 
from these data alone that in any particular age group pasteurized milk 
has no effect. 


ll 


5f.3 Tests of Normality 


Biometric measurements relating to homogeneous groups usually have 
unimodal distributions in which the asymmetry is small and the kurtosis 
is approximately the same as that of normal ‘distribution. Asymmetry 
is measured by VB, where 

2 
M3 
m 
mg and mg being the second and third sample moments. Deviations 
from normal kurtosis are measured by (fz — 3), where 


Ma 


mg, being the fourth moment. Deviations from these normal values, 
when significant, are important features of distributions of biometric 
measurements. 

If the observed frequency distribution is given, we can fit a normal 
curve to it and test for the goodness of fit. This is useful in testing 
for all departures of the observed frequency distribution from the ex- 
pected on the normal basis, but it is quite insensitive in testing for 
some specific aspects of the distribution, such as symmetry and kurtosis. 
If deviations other than those in symmetry and normal kurtosis are 


TESTS OF NORMALITY 219 


not important, then in any problem the observed £; and B2 can be 
tested for the expected values 0 and 3. If is the sample size, the 
asymmetry can be tested by using the statistic 


Ana het [a(n + Dm +3) F 
1 ” aan — 9) (5f.3.1) 


as a normal deviate with zero mean and unit variance when n is large. 
The sign in the above statistic is the same as that of mg, the third 
moment. In the same way, the probability of departure from normality 
in kurtosis can be tested by using the normal deviate 


F 6 (n + 1)°(n + 3) (n + 5) 
( aad n+ i) ‘| 24n(n — 2)(n — 3) (iaa) 


The following table gives for 23 samples the values of n, V6, and 
P(V 1), the probability of V B's being smaller than the observed on 
the normal hypothesis. 


W = 


Tap 5f.3a. The Values of 4/8, and P(+/B1) for Nasal Height Distributions of 23 
Castes and Tribes of the United Provinces 


Vai P(vBi) Sample n vB PVA) 


Sample n 


1 86 0.1552 0.73 13 57 —0.5229 0.05 * 
2 91 0.1025 0.66 14 191 0.0600 0.63 

3 107 0.5442 0.01* 15 159 —0.3382 0.04* 
4 139 —0.1910 0.18 16 99 —0.0995 0.34 
5.168 —0.3820 0.02* 17 156 0.1741 0.81 
6 150 —0.0100 0.48 18 157 0.1646 0.80 
7 124 —0.3950 0.04 19 197 —0.1192 0.25 
8 187 0.1000 0.71 20 100 —0.4992 0.02* 
9 113 0.0000 0.50 21 105 0.0693 0.61 
10 63 0.4118 0.92 22 101 0.0877 0.36 
jı 94 —0.0970 0.35 23 182 —0.2002 0.13 


12 173 0.0173 0.54 


ut 5 values (marked with an asterisk in the table above) 
significant on the 5% level against the expected *?40 (= 1.15). To 
test for overall significance the P) test explained in 5f.2 may be carried 
out. The value of P) is 69.4880, which as ax? with 2 X 23 = 46 degrees 
of freedom is significant, thus indicating skewness of the nasal height 
distributions. On the whole the distribution of nasal height is negatively 


kew. : uf : 
a values of 62 can be treated in an exactly similar way, using the 


formula (5f.3.2). 


There are abo 


220 LARGE SAMPLE TESTS AND PROBLEMS OF ESTIMATION 


References 


Anscomse, F. J. (1948). The transformation of Poisson, binonial and negative- 
binomial data. Biom., 35, 246. 

Bartiert, M. S. (1937). Properties of sufficiency and statistical tests. Proc. Roy. 
Soc. A, 160, 268. ý 

Experton, E. M. (1933). The Lanarkshire milk experiment. Ann. Eugen. London, 
5, 326. : á 

Fisupr, R. A. (1949). A preliminary linkage test with agouti and undulated mice. 
Heredity, 3, 229. i 

HorertnG, H. (1940). The selection of variates for use in prediction with some 
comments on the problem of nuisance parameters. Ann. Math. Stats., 11, 271. 

Rao, C. R. (1948). Large sample tests of statistical hypotheses concerning several 
parameters with application to problems of estimation. Proc. Camb. Phil. Soc 
44, 50. a 

Warp, A. (1949). Note on the consistency of the maximum likeli} i 

_ Ann. Math, Stats., 20, 595. ‘ediciones 

Yares, F. (1934). Contingency tables involving small numbers and the x? 
J. R. S. S. Suppl, 1, 217. ii 


CHAPTER 6 


Tests of Homogeneity of Variances 


and Correlations 


6a Homogeneity of Variances 


6a.1 Test for a Specified Variance 

It is sometimes necessary to test whether an estimated variance is in 
agreement with a specified hypothetical variance. If s? is the estimate 
based on n degrees of freedom,* of the hypothetical variance o°, then 


the statistic 


with n degrees of freedom to test the above hypothesis. 
or head breadth calculated from measurements on 29 
crania from Jebel Moya (Sudan) is 48.5632. Could this sample have 
arisen from a homogeneous cranial population with a head breadth 
variance of 18.2313? The degrees of freedom in this case are 28, one 


less than the number of observations, and 


can be used as x” 
The variance f 


ns” 
Oi lar A 77.2485 

o 
whether the Jebel Moya population belongs to 
Heterogeneity would increase the internal vari- 
ance so that high values of x? would indicate significance. The observed 
value exceeds the 5% value of x? with 28 degrees of freedom so that the 
hypothesis is rejected. The hypothetical value considered in this case 


te s? is obtained by dividin 
om equal to one less than the sam, 


The object of inquiry is 
a homogeneous group- 


* The estima’ g the sae? ee of patients by the 
degrees of freed ple size if only a single sample is 


available. 
: 221 


222 HOMOGENEITY OF VARIANCES AND CORRELATIONS 


is the variance for head breadth derived from a large series of crania 
from Egypt and Sudan. The higher variance in the present case suggests 
that the cranial population of Jebel Moya is heterogeneous. 

We could also test by considering the lower tail of the x? distribution 
whether an observed variance is significantly less than the hypothetical 
variance. Thus, for a hypothetical variance of 90, the x? in the above 
problem would be 15.6481, the probability of exceeding which is more 
than 95%. The probability of obtaining a x? less than the observed 
is therefore less than 5% so that the observed variance is significantly 
less than the assigned one. 

Suppose that it is not known that the specified value 18.2313 of the 
variance is for a homogeneous population, and it is desired to know 
whether the observed variance could have reasonably arisen from a 
population with a hypothetical variance of 18.2313. In a problem of 
this nature we are interested in both small and high values of xX that 
might arise, both disproving the null hypothesis. The test procedure 
in this case is slightly complicated, and we may consider two si 
depending on the sample size. 

(i) When the sample size is large (greater than 30): When the sample 
size is large, x” tends to be normally distributed with degrees of freedom 
n as mean and variance 2n so that the test reduces to 


tuations, 


xL-—n S 
Äi ake 


where A is the 5% or 1% value of the normal deviate, considering both 
tails. The approach to normality of x? is slow so that the above approxi- 
mation may not hold good in moderately large samples. A better 


approximation is to use V 2x” as anormal variate with mean V2n — 1 
and unit variance, in which case the test is 


As | V2? — V2n—1|>2 


A third and a fairly accurate approximation is to use (x?/n)* as a 
normal variate with mean (1 — 2/9n) and variance 2/9n so that the 


test is 
AK 9 
(©) +Ž-iļ2a 
n 9n 


we 


TEST FOR A SPECIFIED VARIANCE 223 


TABLE 6a.la. 


og 
i 


DONG oPRwWNHH 


10 


The Admissible Range of the x? Distribution 


5% Level 1% Level 
x? x? x° x? 
0.0°3163 7.8155 0.051341 11.3458 
0.08480 9.5282 0.01746 13.2866 
0.2961 11.1930 0.1011 15.1251 
0.6072 12.8008 0.2640 16.9004 
0.9890 14.3700 0.4965 18.6180 
1.4250 15.8964 0.7854 20.2980 
1.9026 17.3922 1.1214 21.9366 
2.4136 18.8616 1.4984 23.5304 
2.9529 20.3058 1.9017 25.1352 
3.5160 21.7290 2.3450 26.6500 
4.0997 23.1341 2.8061 28.1820 
4.7004 24.5244 3.2916 29.6808 
5.3170 25.9012 3.7960 31.1662 
5.9472 27.2650 4.3162 32.6410 
6.5910 28.6140 4.8525 34.0995 
7.2448 29.9552 5.4048 35.5376 
7.9101 31.2851 5.9670 36.9750 
8.5842 32.6070 6.5448 38.3886 
9.2682 33.9188 7.1307 39.8012 
9.9580 35.2260 7.7300 41.1940 
10.6554 36.5274 8.3349 42.5880 
11.3608 37.8180 8.9518 43.9670 
12.0727 39.1046 9.5772 45.3353 
12.7896 40.3872 10.2072 46.7064 
13.5150 41.6575 10.8475 48.0600 
14.2428 42.9286 11.4946 49.4104 
14.9769 44.1909 12.1446 50.7600 
15.7136 45.4552 12.8044 52.0968 
16.4575 46.7103 13.4676 53.4354 
17.2050 47.9610 14.1360 54.7680 


Consider an example where x” = 43.773 and n = 30. The normal 
deviates corresponding to Ay, As, and Ag are 


V87.546 — V59 


v60 


43.773 — 30 _ 


ll 


1.7781 


1.6755 


1.6452 


Probability = 0.075 


Probability = 0.095 


Probability = 0.100 


224 HOMOGENEITY OF VARIANCES AND CORRELATIONS 


The most accurate of the three approximations is the last one; in fact 
it gives almost an exact value to the probability of x?’s exceeding 43.773, 
which is 0.05 as seen from x? tables. The probability 0.100 correspond- 
ing to Az is very high so that the value of x? = 43.773 does not give 
sufficient indication as to whether the observed variance differs from the 
assigned value. 

(ii) When the sample size is small (less than 30): If the observed 
x” is above the lower 5% value or below the upper 5% value, no further 
analysis is needed; the hypothesis cannot be rejected. The doubt arises 
only when x? is beyond these limits, in which case the limits to non- 
significant values of x? have to be determined. 

Table 6a.læ gives the admissible range of x? as determined by the 
locally most powerful unbiased test (see 8a.4) of Neyman and Pearson 
(1939). The value of x? = 77.2485 in the case of the Jebel Moya 
population lies outside the 5% admissible range (15.7136, 45.4552) for 
28 degrees of freedom so that the null hypothesis is rejected. 


6a.2 Test for a Specified Inequality of Two Estimated Variances 


The statistical analysis of data often leads to a number of estimated 
variances of which it is desirable to test the homogeneity. This problem 
is considerably simple when there are only two estimated variances and 
it is desired to test whether a particular estimate significantly exceeds 
the other. Thus one might have the estimates of the head length 
variances sı? = 42.302 and s2 = 34.658 based on nı = 24 and ng = 
30 degrees of freedom for males and females. The important question 
to be asked, in this connection, is whether the male head length is, as 
commonly believed, more variable than the female head length. The 
statistic to be constructed for this purpose is 

s? 
F = Ž = 1.2205 
S2 , 
which can be entered in the variance ratio table with nı = 24 and 
n2 = 30 degrees of freedom. The 5% value of F is 1.89, which is 
greater than the observed F, so that there is no evidence against the 
equality of variances in the two sexes. 

It may be noted that this does not prove that the male and female 
head lengths are equally variable. The evidence supplied by the above 
data may not be sufficient to detect the difference, if any. The ratio 
such as the observed could be expected not infrequently in samples of 
the above sizes when, in fact, the variances are equal. To detect the 
difference as indicated by the above ratio a very large number of degrees 
of freedom, i.e., a large sample, would be necessary. 


SPECIFIED INEQUALITY OF TWO ESTIMATED VARIANCES 225 


If sample sizes are large, the test for equality of standard deviations 
can be carried out in a simple way. If any estimated variance s? based 
on n degrees of freedom is transformed by the relation (see 5e.4) 


y = loge s 
then y tends to be normally distributed with 


1 
Mean = loge and Variance = F 
n 


so that the variance is independent of e. Kemsley (1950) found the 
following standard deviations for heights of males and females. 


Sample Variance 
Number (n) s loge s Yon 
ou 27515 2.85 1.0473 000001817 
9 33562 2.58 0.9478 0.00001490 
Difference 0.0995 0.00003307 


The standard error of the difference is V 0.0'3307 = 0.005751, and the 
ratio of the difference to standard error is w = 17.30, which as a normal 
deviate has a very small probability, thus showing that the variabilities 
are different for @ and 2, females being less variable than males. This 
is in accordance with an observed fact in biological material that the 
standard deviation depends on the mean size of organisms: the higher 
the mean, the greater is the scatter. The female dimensions are smaller 
than the male, and this is reflected in the standard deviation. We 
can make a closer examination by considering the data for various 
ages since the overall standard deviation may be affected by the age 
composition of the samples for males and females. 


TABLE 6a.2c. Standard Deviation of Stature * (Kemsley, 1950) 


Standard Deviation Sample Size Statistic 

Age re Q a Q w 

14 3.66 2.68 305 244 5.13 
14.5 3.71 2.40 892 732 12.35 
15.5 3.57 2.50 1404 1378 13.29 
16.5 3.07 2.35 1644 1699 10.92 
17.5 2.95 2.50 1280 1724 6.34 
18.5 2.86 2.23 786 1535 8.02 
19.5 2.87 2.34 591 1376 5.87 
20.5 2.70. 2.31 467 1284 4.08 
21.5 2.65 2.34 460 1224 3.21 
22.5 2.62 2.37 477 1250 2.63 


* The ages in the original table extend up to 74.5, and only the first 10 have been 
chosen to illustrate the test. 


226 HOMOGENEITY OF VARIANCES AND CORRELATIONS 


It is found that at each age the statistic w exceeds the upper 5% 
value of the normal deviate, showing that female variability is less at 
each age. If samples were not so numerous as those above, it would 
be difficult to detect the difference in variabilities at each age. In such 
a situation the normal probabilities for exceeding w could be calculated 
at each age and the P) test (5f.2) carried out to combine the evidences 
supplied by all age groups. 


6a.3 The Likelihood Criterion and Its Use 


With two estimated variances a situation different from the one con- 
sidered in 6a.2 may arise. The estimated variances of nasal indices 
may be available for two series of male skulls. The question to be 
asked is whether the variances in the cranial populations from which the 
two series are samples can be considered equal. In the absence of any 
knowledge as to the possible inequality relationship between the two 
population variances, the following statistic 


L a F nD (512) Mone) Ort? 


N81" + nasa? 


L 


which is the ratio of the weighted geometric mean to the arithmetic 
mean of the estimates, may be constructed. This lies between 0 and iL, 
the value 0 being reached when the ratio of one estimate to the other 
is large, and 1 when the two are equal. A small value of L would thus 
indicate a difference in the population variances. 

Instead of L it is convenient, from the point of view of computations, 
to consider the statistic (due to Bartlett, 1934) 


74817 + nose? 


n 


M = —nlog, L = n loge l | — ny loge $1? — ng loge so” 


where n = ni +n. M varies from 0 to œ, with small values of L 
corresponding to high values of M. 

This can be extended to the case where k estimated variances have to 
be tested for homogeneity. If 81", +++, 2 are the estimated variances 
with 7, - ++, ng degrees of freedom, then the statistic to be constructed is 


2 
MS +++ nys,? 


n 


M n loge | } — n log, s? —-+++— my loge S 


where n = my +---+ np. 


PRACTICAL APPLICATIONS 227 


The probability of 1/’s exceeding the observed value M, can be ex- 
panded in an asymptotic series, the first six terms of which are given 
here. 


6 
A D B:Pr-142i(Mo) 


» 6: om 


0 


where P—1+2:(Mo) stands for the probability * of the x? with (k — 1 
+ 2i) degrees of freedom exceeding Mo. The values of £ are 


Bo = 1 
Bi = $1 a=2o—5 
Bo = 4b? 

1 1 
Bs = —gscs + 3b? ce ae 
Bs = BBs — 48261" 

1 1 
“Bs = gists — 756p — 3PiPa = BT w 


a 
Bo = rkabics — yobs — 6F18s 


6a.4 Practical Applications l E 

The exact formula derived above for evaluating the probability of 
exceeding the observed value M, need not always be used in practice. 
The approximations which can be profitably used in many situations 
are gi ith suitable illustrations. 

A heaton Tf the observed value M, is less than the 5% 
(or 1%) a of x? with (k — 1) degrees of freedom, then the hypothesis 
of equality of variances cannot be rejected on the 5% (or sid lev = 
This is due to the fact that the exact 5% and 1% limits of M are beyon 
th à ing values of the x“ limits. ; 

avis aa. eens eek of 10 estimates of variance, calculated on 
10 sampl ; f ei ht records of schoolboys of similar age but. rom 
different ted it is desired to test whether there are any real “form 
differences” in the weight dispersion of the boys. 


* TH I be obtained from Tables of the Incomplete r-Function, edited 
hese values can 


by K. Pearson. 


228 HOMOGENEITY OF VARIANCES AND CORRELATIONS 


TABLE 6a.4e. The Estimated Variances and Evaluation of the M-Statistic 
(Hartley and Pearson, 1946) 


Form No. of | Weight g a 2 2 2 
No. t Bova: yamai D.F. = ne} loge s? | neloge s:2| nisi 
T 10 51 9 3.93 35.4 459 

2 15 78 14 4.36 61.0 1092 

3 21 91 20 4.51 90.2 1820 
4 23 52 22 3.95 86.9 1144 

5 15 101 14 4.62 64.7 1414 

6 11 36 10 3.58 35.8 360 
7 31 41 30 3.71 111.3 1230 

8 15 76 14 4.33 60.6 1064 

9 3 64 2 4.16 8.3 128 
10 6 93 5 4.53 22.6 465 
Total 140 =n 576.8 | 9176 

(nis?) 
M = nlog {owe — En log sê 
n 


140 log (228) — 576.8 


ll 


140 X 4.183 — 576.8 = 8.8 


The 5% value of x? with (10 — 1) =9 degrees of freedom is 15.51 so 

that the observed value 8.8, being less than this, cannot be considered 

significant. No further calculations are needed in this case. The best 

estimate of the common variance is (Zn,s,2)/n = 9176/140 = 65.54. 
Second Approximation. Bartlett (1934) suggested the use of 


"E M 
1+ 4/3(k — 1) 


as x” with (k — 1) degrees of freedom, where cı = D(1/n,) — (1/n). 
This approximation tends to increase the probability for an observed 
M” so that if M’ is significant on any desired level then M is certainly so. 

The following data give a number of variances estimated on different 
degrees of freedom. These have been calculated from the yields of rice 
observed in successive blocks (columns) when a rectangular lattice is 
superimposed on a big field. It is desired to test whether the block 
variance is independent of its size, i.e., the number of cells it contains. 


PROBLEMS REQUIRING AN EXACT TREATMENT 229 


TABLE 62.48. Variances for Different Block Sizes 


Mean Variance 
No. of : Q E 2 
etn for Blocks of D.F. loge s? | me loge s? nis? 2 
* | the Same Size n Ue 
14 48.76 26 | 3.8870 | 101.0620 1,267.76 | 0.038461 
23 101.97 44 | 4.6250 | 203.5000 | 4,486.68 | 0.022727 
35 122.67 170 | 4.8096 | 817.6320 | 20,853.90 | 0.005882 
45 94.39 264 | 4.5474 | 1200.5136 | 24,918.96 0.003788 
51 78.40 200 | 4.3618 | 872.3600 | 15,680.00 | 0.005000 
Total 704 3195.0676 | 67,207.30 | 0.075858 
67,207.30 
M = 704 loge (ae — 3195.0676 
704 
= 704(4.5589) — 3195.0676 = 14.3276 


This value is greater than the 1% value of x2 with 4 degrees of freedom. 
In this case the alternative statistic is 
M 


ee ee 
1+/3(k — 1) 

where cı = 0.075858 — 0.001420 = 0.074438. 
14.3276 14.3276 


Ve = —_—_——— 5 Tannen 14.2393 
1 — 0.074438/12 1.006203 


M' 


This also exceeds the 1% value of x2 with 4 degrees of freedom so that 


the variances cannot be considered equal. l 
This shows that the variance is à function of the block size. Contr ary 
to expectation, this is not an increasing function of the block size. The 
decrease in variance after a certain size of the block may be due to some 
Periodicity in the fertility gradient. 
6a.5 Problems Requiring an Exact Treatment — g 
It is seen in 6a.4 that in any practical situation a decision can a 
reached if M < 5% value of x? with (k a T . se 
igni nd = alu 
significant) and M’ = M/L + c1/3(k - > 5% vV: l 
(& — 1) plies of freedom (significant). There may arise ei yet 
w< 5% value of xX <M. The formula given m 6a.3 then has to be 


230 HOMOGENEITY OF VARIANCES AND CORRELATIONS 


used. In practice the situation presented above will occur only when 
the M statistic is just near its significance limit. The decision as to the 
acceptance or rejection of the hypothesis depends on the further use of 
these estimated variances in statistical analysis. This being so, the 
evaluation of the probability to a high degree of accuracy rarely will be 
necessary in practical problems. 

It may be noted that all the tests based on the F and L statistics 
given in 6a.2 and 6a.3 can be extended to test for an assigned ratio 
piip2: *** tp, Of the hypothetical variances. The only modification 
needed is to replace s;? by s7/p;. 


6b Homogeneity of Correlations 
6b.1 Exact Test for Zero Correlation 


It has been shown (example 2, in 2d.2) that, when the correlation in 
a bivariate population is zero, the statistic 


where r is the correlation coefficient calculated on a sample of size n, 
is distributed as 


1 dt 
n—2 1 e Nene 
ma Nr) 
2 2 n—2 


This is the same as the ¢ distribution with (n — 2) degrees of freedom. 
To test whether the population correlation coefficient is zero, the sta- 
tistic t defined above is calculated and tested for significance by the 
use of the ¢ table. 

The correlation between frontal breadth and head breadth calculated 
from 18 crania of a series is 0.6521. Are the two dimensions, frontal 
breadth and head breadth, uncorrelated? The value of ¢ is 


q 0.6521 
v1 — (0.6521)? 


V16 = 4(0.8602) 


= 3.4408 


with 18 — 2 = 16 degrees of freedom. This exceeds 2.120, the 5% 
value of ¢, so that the observed correlation can be interpreted as estab- 
lishing an association between frontal breadth and head breadth. 


FISHER’S TANH7! TRANSFORMATION 231 


6b.2 Fisher’s tanh~! Transformation 

It was shown in 5e.4 that the tanh} transformation of the correlation 
coefficient r gets rid of the unknown parameter p in the expression for 
variance. Accordingly we consider the transformed values ¢ and z in- 
stead of p and r. 


1 AFA 
= F = — log = tanh™! 
t = F(p) 3 ee tanh™ p 
ah 1 r 
z= F(r) =; lke a = tanh? r 


Putting z — ¢ = 2, the distribution of x may be derived from the 
distribution of r. The first four moments of z were found by R. A. 
Fisher and later revised by A. K. Gayen. 


P 5+ 
E TE 
j 2(n — A Fam 1) 


2 2 4 
1 4—p? 22-6? — Sot 
m= dita t ae Tt 
n—1 2(n — 1) 6(n 
u3 = # 
3 m= 
1 14 — 3p 184 — 48% — 21p* = 
Ma I al 4(n — 1)? 
6 
P 
Bi @— 0 
2 — gpt 
ee 


B =3+— t GI} 


even for moderate n, it follows that 


i Il j 
Since ği and (P — 3) aem aa proximately a normal variate with 


(z — §) can be considered to be ap! 
PESA ae 
Mean = wo 

1 4— 1 


Variance = 4 Ia- 2-3 


232 HOMOGENEITY OF VARIANCES AND CORRELATIONS 


6b.3 Test for a Given p 


In asample of 28 the correlation coefficient is found to be 0.6521. Can 
such a value have arisen from a population in which the coefficient has 
the value 0.7211? 

1 1+r 
z = — log. —— = 0.7790 * 
rs 


1 1 
Mean z = — log, La -P 
2 l—p 2(n—1) 
0.7211 
= 0.9100 + a = 0.9233 


The normal deviate is 


Vn — 3 (z — mean z) 


V 28 — 3 (0.7790 — 0.9233) 


5(—0.1443) = —0.7215 


The chance of exceeding the value 0.7215 in either direction is about 
35% so that the hypothesis cannot be rejected. 

The correction term p/2(n — 1) for the mean z is unimportant if n 
is large. The probability will be more precisely obtained in any case 
by its inclusion. 


6b.4 Test for the Equality of Two Correlation Coefficients 


Two samples consisting of nı and ng observations give the correlation 
coefficients rı and rə. Are these values compatible with the hypothesis 


that the samples arose from two populations having the same correlation 
coefficient? Let 


1+rn 
zı = -log 
l= 
and 
1 1+7r2 
z = - log 
l-r 


The statistic z1 — zg is distributed about the mean 
p a p 
2(m — 1)  2(n2. — 1) 


* These values can be directly obtained from the Fisher-Yates tables (transforma- 
tion of r to 2). 


(6b.4.1) 


HOMOGENEITY OF A SET OF CORRELATION COEFFICIENTS 233 
where p is the common correlation coefficient, with variance 


ii 1 


m- 3 no — 3 


If the samples are not small or if nı and na are not very different, the 
statistic 
Zi — 22 


Vi/(m — 3) + 1/(n = 3) 


can be used as a normal deviate. The more exact method given in 
6b.6 may be necessary when the value of (6b.4.1) is not small. 


6b.5 Test for the Homogeneity of a Set of Correlation Coefficients 
Let 71, +++, 7 be k correlation coefficients based on samples of sizes 
Ni, +*+, ny By means of the tanh transformation, the quantities 21, 
++, zę corresponding to 7, +++, 7, can be obtained. If the bias in 
mean z can be neglected, the test for homogeneity of the correlation 
coefficients is equivalent to the test of equality of the mean values of z. 
The scheme of computation is as follows. 


TABLE 6b.5a. Test of Homogeneity of a Set of Correlation Coefficients 


Sample | Sample | Correlation tanh Reciprocal , 
No. Size Coefficient A Ere of Variance (n — 3)z (n — 8)? 
t n r n—-3 
1 n ri z ny —3 (m — Bar | 0u — 3)? 
k nk Tk Zk n — 3 (nk — 3)zk | (nk — 3)22 
Total N Ti Te 


The best estimate of tanh™! p when the various coefficients are 


homogeneous is 7;/N. The statistic for testing homogeneity is 
Ts? 

2 = Ta —-—, 

x 2-0 


which can be used as x” with (k — 1) degrees of freedom. 


234 HOMOGENEITY OF VARIANCES AND CORRELATIONS 


As an example, let the correlations obtained from 6 samples of sizes 
10, 14, 16, 20, 25, 28 be 0.318, 0.106, 0.253, 0.340, 0.116, 0.112. Can 
these be considered homogeneous? 


Sample 
Correlation Size 
Coefficient Minus 3 
r n-3 z (n — 3)z (n — 3)? 
0.318 7 0.3294 2.3058 0.7595 
0.106 11 0.1064 _ 1.1704 0.1245 
0.253 13 0.2586 3.3618 0.8694 
0.340 17 0.3541 6.0197 2.1316 
0.116 22 0.1164 2.5608 0.2981 
0.112 25 0.1125 2.8125 0.7031 
Total 95 18.2310 4.8862 
Tı 1820 = 0,191905 
95 95 i 
Tı 
T — Tı 35 = 4.8862 — 3.4986 = 1.3876 
Ə 


The value 1.3876 as x? with 5 degrees of freedom is not significant, so 
the correlations may be considered homogeneous. 


6b.6 Correction for Bias in the Test for Homogeneity and the 
Best Estimate of p 

When the sample sizes are not large and not nearly equal, there is a 
certain amount of bias (extraneous to the hypothesis tested) introduced 
in the x? statistic used in 6b.5. This is due to neglecting the term 
p/2(n — 1) in the mean value of z. Even if the bias introduced in the 
x? statistic is small, the bias introduced in the best estimate of p when 
x? is not significant will not be small when compared to the standard 
error of the estimate. This can be corrected by following a slightly 
different procedure. 

Since z can be considered as a normal deviate with mean 14 loge 
(1 + p)/(1 — p) + p/2(n — 1) and the variance 1/(n — 3), the score 
for p obtained from k samples is 


1 1 i 1+p p | 
= y 3 2 pad = 
ae |e E 2 T= mA i 
(6b.6.1) 


\ 


REFERENCES © 235 


and the information 
I= KS? =( »| i tr : | (6b.6.2) 
= - “) = : — — —— 6. 
63 SAN 1— ê Xu- 1) 
If the value of p obtained in the last section is taken as a first approxi- 
mation, then the additive correction ôp to this value is given by 


where Sp and To are the values of (6b.6.1) and (6b.6.2) calculated at the 
approximate value chosen. This process may be repeated till the correc- 
tion becomes negligible. Having obtained the best estimate ô of p, 
the x? statistic with (% — 1) degrees of freedom for testing homogeneity 


1s 


5 1, 1+4 ĝ | 

a ox A Sr cmd oo — — — — 

af = Be) f 3 Ip Xm- 1) 
References 


Barrierr, M. S. (1934). The problem in statistics of testing several variances. 


Proc. Camb. Phil. Soc., 30, 164. , , 
Hannusy, H O. end E S. ' PEARSON (1946). Tables for testing the homogeneity 


i 7 Bi 33, 296. 
of a set of estimated variances. Prefatory note. iom., 38, 2 è 
KrmsueY, T 5 ẹ. (1950). Weight and height of a population in 1943. Ann. 
$ W: Fi F- 


Eugen. 1. : 

Ni ass roan —_ (1936). Contributions to the theory of testing ers 
tistical hypotheses 1. Unbiased critical regions of type A and type A1. t. 
Res. Mem., 1.1. 


GHAP TER. 7 


Tests of Significance 


in Multivariate Analysis 


7a Review of Work on Multivariate Analysis 


Attempts have been made in recent years to generalize the univariate 
analysis of variance technique to the case of multiple variates. The 
extension of the theory has been slow, and only a few methods have been 
made available for practical use. The starting point of these researches, 
given by Wishart in 1928, is the simultaneous sampling distribution of 
the variances and covariances in samples from a multivariate normal 
population. A few years later Hotelling (1931) found the distribution 
of a quantity T which is a natural extension of Student’s distribution to 
a sample from a multivariate normal population. 

Wilks (1932), following the likelihood ratio method (Neyman and 
Pearson, 1928, 1931; Pearson and Neyman, 1930), obtained suitable 
generalizations in the analysis of variance applicable to several variables. 
The statistic A proposed by him has been found useful in a variety of 
problems. Bartlett (1934) applied it for testing the significance of treat- 
ments with respect to two variables in a varietal trial and indicated its 
general use in multivariate tests of significance. Wilks (1935) and 
Hotelling (1935) found it useful in testing the independence of several 
groups of variates. Wilks’s statistic supplied some of the basic tests 
in multivariate analysis, but the problem of tabulation has not been 
tackled except in some limited cases (Wald and Brookner, 1941). A 
very useful approximation has been suggested by Bartlett (1938), who 
further demonstrated its use in another paper (1947). 

A new line of research was initiated by Fisher (1936) with his intro- 
duction of the discriminant function analysis. It has been shown that 
a set of multiple measurements may be used to provide a discriminant 
function linear in the observations having the property that, better 
than any other linear function, it will discriminate between any two 
chosen classes such as taxonomic species, the two sexes, and so on. 

236 


| 


TWO FUNDAMENTAL DISTRIBUTIONS 237 


The introduction of the discriminant function led to a new method 
of deriving test criteria suitable for multiple variates. The problem is 
reduced to the case of a single variate by using a linear compound of the 
several variables, where the compounding coefficients are chosen to maxi- 
mize the value of a statistic suitable for a single variate. The applica- 
tion of this method to test the differences in mean values for several 
groups gave rise to the theory of canonical roots of determinantal equa- 
tions (Roy, 1939; Fisher, 1939; Hsu, 1939). The distribution of the 
individual roots and the exact nature of tests require further study. 
Wilks’s statistic, which is a symmetric function of the canonical roots, 
may be considered as providing an overall test of the hypothesis con- 
cerned. 

In this chapter a unified approach to the problem of tests of signifi- 
cance in multivariate analysis is developed. The concept of analysis 
of dispersion, which is a natural extension of the univariate analysis 
of variance, has been found useful in discussing multivariate problems. 

In presenting the various tests of significance it has been found con- 
venient to consider the problems arising out of a single sample and 
two samples in the first stage. They depend on simple tests of signifi- 
cance requiring the use of variance ratio tables alone and are of very 
great importance in practice. The use of Wilks’s statistic in multi- 
variate analysis involving more than two samples is considered in the 
second stage. Two powerful approximations have been found for the 
exact distribution of the A statistic. A number of examples have been 
worked out to explain the computational procedure. 


7b Tests with Discriminant Functions 


7b.1 Two Fundamental Distributions 

The method of discriminant functions in deriving test criteria has 
been found extremely useful in multivariate analysis. The problem is 
reduced to that of a single variable by choosing a linear compound of 
the original variables and constructing a statistic suitable for the uni- 
variate case. The maximized value of this statistic obtained by a suit- 
able choice of the compounding coefficients is taken as the appropriate 
test criterion. The distribution of the statistics thus derived in prob- 


lems involving a single sample and two samples depends on the two 


fundamental distributions considered below. M 

Let (w), (Gj = l, 27t p), be the matrix giving the estimates, on 
n degrees of freedom, of the elements in the dispersion matrix (CP) of 
p normally correlated variables. The definition of wiz implies that it 
has been calculated from a certain sum of products by dividing by the 


238 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


appropriate degrees of freedom. Let dı, d2, -+ +, dp be p normal variates 
with the same dispersion matrix (a;;) but distributed independently of 
wij. Considering only the first r variables, dı, ---, dp, the statistic T, 
is defined by 

nT, = >>> w, dd; (7b.1.1) 

‘ T 
where (w,”) is the matrix reciprocal to (w,j), (i, j = 1, 2, «++, 7). It 
was shown in 2d.2 that, when E(dı) =---= E(d,) = 0, the statistic 
| nw; | 1 


| nwi; + did; | = 1+ 7, 
is distributed in the beta form 


BE, 
2 2 


in which case T, has the distribution 


pt2) 
const. G+ Teme aT 
This shows that 
HEFL 
a E in 
r 
can be referred to a variance ratio table with r and (n — r + 1) degrees 
of freedom. 
It was further shown that, if dp41, ---, dp are distributed independ- 
ently of dı, +++, d, and E(d,41) =---= E(d,) = 0, E(d;) being not 
necessarily zero when i = 1, ---, r, the statistic 


| mwisl» | nwy + ddila 1+ 7, 


= = (Uper, +1) 
| nw + didj|p | nwi |e 1t Tp ae 
is distributed as B[(n — p + 1)/2, (p — r)/2]. This shows that 
n+1— 
aoe A, (7b.1.2) 
p= 


can be used as a variance ratio with (p — r) and (n + 1 — p) degrees 
of freedom. The statistic T, is calculated from the formula (7b.1.1) 
by using all the p variables. 

All the tests of significance considered in this section depend on the 
use of the statistics defined in (7b.1.1) and (7b.1.2). 


—— rl E 


eee a a 


AY 


PROBLEMS OF A SINGLE SAMPLE 239 


7b.2 Problems of a Single Sample 

Student’s test connected with pairs of observations admits generaliza- 
tion in two directions. 

The first is to test whether the means of p correlated variables are the 
same on the basis of a sample of size N from a p-variate population. 
When the test shows differences in mean values, there arises the question 
of deciding whether an assigned contrast involving the p variates differs 
from the best contrast as determined from the data. 

If 213, ti ***, Zp: are the observations on the ith individual, then 
they may be replaced by a linear compound z; = aig +++: + lptpi 
where l; satisfy the condition lı +-+- -+ lp = 0. The problem of deter- 
mining the best contrast reduces to that of determining the compound- 
ing coefficients lı, ---, lp such that the ratio of mean z to standard 
deviation of z is a maximum. An alternative method which has some 
practical advantage is as follows. 

By arbitrary choice of constants we construct (p — 1) independent 
linear combinations of the variables x, -+-, xp, 


Yj = MX H+ + Mpjtp 
such that E mj = 0 for j = 1, 2, --+, (p — 1). Choosing a linear 


compound of x with coefficients adding to zero is the same as choosing 
a linear compound of y without any restriction on the compounding 
coefficients. If the linear compound is 


MYL + Aowe he Appt 
then the quantity to be maximized is 
(ag ++ pp)” 
E EEN AjWij 


where 


¢ g _ E 
Wij = neS (Yir — F) Yir — Gs) 


Observing that only the ratios of à are uniquely determinable, the 
equations giving \ may be written 

Mw feeb Appii = Fi i=1,2, ++, (p— J) 
with the solution 

x = wj feet wP pa i= 1,2) 24 = 1) 


where the matrix (w) is reciprocal to (wij). This supplies the best 


240 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


linear compound of y, which on transformation to x gives the best 
contrast determinable from the data. 
The maximum value of v is given by 


2g: = Vwg 4; 
If Tp-1 = N(22w"y.9;)/(N — 1), then, on the hypothesis that all x 


have the same mean value, the conditions required for the use of the 
statistic (7b.1.1) are satisfied so that 
Tp—1(N =F + 1) 
== OE 
(p — 1) 
can be used as a variance ratio with (p — 1) and (N — p + 1) degrees 
of freedom to test the above hypothesis. 

The statistic T,_, is invariant for all sets of coefficients chosen to 
construct y from x so that in any practical problem either conveniently 
or conventionally chosen linear contrasts of x may be used to define y. 

To test whether the best contrast as determined from the data is in 
agreement with an assigned contrast Eiti +--+ ptp or myi H+ 
%p—1Yp—1 in terms of y, we proceed as follows. 

The appropriate statistic for testing the significance of the assigned 
contrast is 

T= NJi ++ +++ nmap)? 
= 
WN = 1) (22 nnzw;;) 


where T;(N — 1) is a variance ratio with 1 and (N — 1) degrees of 
freedom. The appropriate statistic for all the (p — 1) contrasts is 
Tp-1, considered before. The hypothesis specifies that all contrasts 
orthogonal to the assigned one have zero mean so that the conditions 
for the use of the statistic (7b.1.2) are satisfied. Hence 


(V — p+1) [a] 
(p — 2) fF Ti 

can be used as a variance ratio with (p — 2) and (N — p + 1) degrees 

of freedom to test the above hypothesis. 


The above test can be generalized to answer the problem whether a 
set of k assigned contrasts contain the best contrast. In this case the 
statistic 


W—p+1) m] 
@ ok — 1) T= 7, 


can be used as a variance ratio with (p — k — 1) and (N —p+1) 
degrees of freedom. 


PROBLEMS OF A SINGLE SAMPLE 241 


Example 1. The data of Table 7b.2a consist of weights of cork 
borings taken from the north (N), east (E), south (S), and west (WW) 
directions of the trunk for 28 trees in a block of plantations. The 
problem is to test whether the bark deposit varies in thickness and hence 
in weight in the four directions. It was suggested that the bark deposit 
is likely to be uniform in Ņ and S directions and also uniform but less 
so in E and W directions, so that (N — E — W + S) can be taken as 
the best contrast. This can, however, be tested from the given data 
as shown below. 


TABLE 7b.2e. Weights of Cork Borings (in Centigrams) in the Four Directions for 


28 Trees 
N E S W N E S W 
72 66 76 77 91 79 100 75 
60 53 66 63 56 68 47 50 
56 57 64 58 79 65 70 6l 
41 29 36 38 81 80 68 58 
32 32 35 36 78 55 67 60 
30 35 34 26 46 38 37 38 
39 39 31 27 39 35 34 37 
42 43 *31 25 832 30 30 32 
37 40 31 25 60 50 67 54 
33 29 27 36 35 37 48 39 
32 30 34 28 39 36 39 31 
63 45 74 63 50 34 37 40 
54 46 60 52 43 37 39 50 
47 51 52 43 48 54 57 43 


It has been found in similar studies that there exists a significant 
correlation between contrasts such as (N — E) and (S — W) so that the 
method of fitting constants for the four directions and the individual 
trees by the method of least squares is not appropriate. The three 
contrasts arising out of the four weights may then be treated as three 
correlated variables, in which case the theory developed above is appli- 
ae interesting to observe that the individual weights in Table 7b.2a 
are exceedingly asymmetrically distributed. This does not, however, 
invalidate the test so long as the contrasts are normally distributed. 
In fact, the distribution of the individual weights depends on the nature 
of plants and the variation between plants. If the above condition is 
satisfied, it is not necessary that the individual weights should follow 


° 


242 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


any distribution law of the known type. It may be sometimes necessary 
to make a transformation (such as log, square, or cube root) of the 
variables under consideration to ensure that the contrasts of the trans- 
formed variables are symmetrically distributed if the contrasts of the 
original variables are not so. 

As observed earlier, the contrasts may be conveniently or con- 
ventionally chosen. In the above example we may choose the simple 
set of contrasts 


y= N-E-W+S y =S- W y¥3=N-S 


The mean values and estimates of variances and covariances based 
on 27 degrees of freedom for y1, Y2, Yz are 


Jı = 8.8571 J = 4.5000 j; = 0.8571 
128.7200 61.4076 —21.0211 
(wij) =| 61.4076 56.9259 —28.2963 
21.0211 —28.2963 63.5344 


The coefficients of the best linear function My1 + Aoye + Agys are given 
by the equations 


128.7200A; + 61.40762 — 21.0211); = 8.8571 
61.4076A; + 56.9259\2 — 28.2963); = 4.5000 
—21.0211\; — 28.29632 + 63.5344\3 = 0.8571 


Solving, ^4 = 0.05620, \2 = 0.04415, 43 = 0.05174, so that the best con- 
trast is . 


M(N — E — W +S) + ro(S — W) +N — S) 
= 0.10794N — 0.05620£ — 0.100351 + 0.048618 
or, by multiplying the coefficients by 10 (arbitrarily), 
1.0794N — 0.56202 — 1.00351 + 0.48618 
The statistic for testing the hypothesis of equality of means is 


N 28 
-1 = — Agi + OG js) = — (0. = 
Tp = apy Oot + dada + Aaa) 37 740790) = 0.768226 


(N-—p+1) _ 0.768226(28 — 4 + 1) 
(p — 1) 3 


The quantity 6.4019 as a variance ratio with 3 and 25 degrees of freedom 


Tp—i 


= 6.4019 


PROBLEMS OF A SINGLE SAMPLE 243 


is significant at the 1% level so that the bark deposit cannot be con- 
sidered uniform in the four directions. 
The assigned contrast is represented by yı. To test for its significance 
the statistic is 
= 3 9 moy)2 
eae E =e = 0.632020 
N—1w,  27(128.7200) 


The quantity (N — 1)T; = 17.0645 as the variance ratio with 1 and 
27 degrees of freedom is significant. 

To test whether the assigned contrast agrees with that estimated from 
the data, the statistic U defined in (7b.1.2) has to be calculated. 


N-p+l oe Seas a 


Van 
p—2 p-—2 1+T7; 


25 pa 


= = 1} = 1.0431 
2 (1.632020 


This value as the variance ratio with 2 and 25 degrees of freedom is 
small so that the evidence supplied by the data is not sufficient to reject 
the assigned contrast as not the best, although the ratios of the coeffi- 
cients in the estimated contrast depart considerably from those assigned. 

Another problem connected with a single sample is to test for the 
significance of the departures of the observed mean values from those 
assigned. Let 41, «**s Ëp be the mean values based on a sample of size 
N, and £1, «**, Ëp the assigned values. If (w,;) is the covariance matrix 


of ti, tt, 2p estimated on n degrees of freedom, then 
° 


n å 
ate DEwi(a; — £) (8; — E) 


The variance ratio with p and (n + 1 — p) degrees of freedom to test 
the above hypothesis is 
(n+1-— P) _N@+1—?) 
Tp p ap 


EZw” (a; — E) (E; — &) 


s both the mean values and the covariance matrix 


In many problem: s 
ae me sample, in which case n = N-1. 


are estimated from the sa ) } 
Example 2. Consider the covariance matrix 


128.7200 61.4076 —21.0211 


61.4076 56.9259 — 28.2963 
—21.0211 — 28.2963 63.5344 


244 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


estimated on 27 degrees of freedom and mean values 8.8571, 4.5000, 
0.8571 based on 28 observations as in example 1 above. Suppose that 
it is required to test whether the calculated averages agree with the 
assigned values 5, 1, and —2. The deviations are 


8.8571 — 5 = 3.8571 4.5000 — 1 = 3.5000 0.8571 + 2 = 2.8571 
To evaluate the quadratic form 
Z2wdid; 


we follow the form adopted in 1d.1 and sweep out the matrix. 


Dispersion Matrix Deviations 
128.7200 61.4076 —21.0211 3.8571 
56.9259 — 28.2963 3.5000 
63.5344 2.8571 
0 


This gives the last pivotal quantity 0.6529 (with a negative sign) which 
is the value of the quadratic form. The variance ratio with 3 and (27 + 
1 — 3) degrees of freedom is 


2827 +1-3 457.0300 
— (0.6529) = = 5.6423 
27 3 81 


which is significant on the 1% level, indicating departure from the 
expected. 

The second generalization of Student’s ¢ is concerned with testing, 
on the basis of a sample of size N from a 2p-variate population con- 
taining the variables y1, -+ +, Yp, Y¥p41, ‘+ ®, Yop whether the mean values 
of y; and y;+» are the same for all 7. The 2p variates can be reduced to 
p variates 


21 = Yui — Y1 22 = Yp+2 — Y2 saa Zp = Yop — Yp 


in which case the problem reduces to the one considered above. The 
variance ratio with p and (N — p) degrees of freedom is 


N-1 p 


where (w;;) is the dispersion matrix of z1, ---, zp based on (N — 1) 
degrees of freedom. 

The test is useful in various situations. Suppose that we want to 
test for asymmetry of organisms. The sets y1, ---, yp and Ypy t's 
Yop Will then correspond to the same measurements on the right and 


PROBLEMS OF A SINGLE SAMPLE 245 


left sides of an organism. Another interesting study is whether the 
first born in a family differs from the second born. To illustrate the 
method, a random sample of 25 families has been chosen from Dr. 
G. P. Frets’s data giving the head lengths and breadths of all sons 
and daughters in a large number of families in Germany. For effective 


TABLE 7b.28. The Measurements on the First and Second Adult Sons in a Sample 
of 25 Families (Data by G. P. Frets) 


Head Length Head Breadth 
First Second | Differ- First Second | Differ- 
Son Son ence Son Son ence 
191 179 12 155 145 10 
195 201 -6 149 152 -3 
181 185 -4 148 149 =1 
183 188 -5 153 149 4 
176 171 5 144 142 2 
208 192 16 157 152 5 
189 190 -1 150 149 af 
197 189 8 159 152 T 
188 197 -9 152 159 -7 
192 187 5 150 151 -1 
179 186 -7 158 148 10 
183 ia |, 9 147 147 0 
174 185 —11 150 152 39 
190 195 -5 159 157 : 2 
188 187 1 151 158 =7 
163 161 2 | 137 130 7 
195 183 12 155 158 =g 
186 173 13 | 153 148 5 
181 182 api 145 146 =a, 
175 165 10 140 137 3 
192 185 7 154 152 2 
174 178 aa 143 147 -4 
176 176 0 139 143 -4 
197 200 8 167 158 9 
190 187 3 163 150 3 
Pie ci (ee ee E 
Mean difference 1.88 1.48 


aS ee ee oe 


246 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


comparison, only adults of the same sex (say sons or daughters) have 
to be chosen. 

The dispersion matrix of the differences estimated on 24 degrees of 
freedom is 


68.03 re PENS ( 0.015999 pene 
with its inverse 
are 24.01 —0.007677 0.045332 
Lrw"dd; 
= 0.015999(1.88)? — 2(0.007677) (1.88) (1.48) + 0.045332(1.48)? 
= 0.113121 
N N-p 25 23 
(0.113121) = —— (0.1131) = 1.3548 
N-1 24 2 


This is not significant as a variance ratio with 2 and 23 degrees of 
freedom. There is no difference in the dimensions of the first son and 
the second as judged by the above sample. The method described 
above for such studies is quite general and can be applied to any number 
of characters. 


7.3 Mahalanobis’ D? and Problems of Two Samples 

Let N; and N3 be the samples drawn from two populations, each 
characterized by p variates. The sample means for the ith character 
are represented by Z; and ž;ə for the first and second samples, respec- 
tively. The estimated value of the covariance is given by 


Nı 
Ni + Ne — 2)wy = D (ta — Fa) (jr — Zj) 
ta ii 
+ È (tizi — Fin) (tji — jo) 
t=1 


the right-hand expression being the sum of the corrected sums of products 
for two samples. Mahalanobis’ (1936) distance between the two popu- 
lations as estimated from the sample on the basis of the p characters is * 


p p 
2 Neg oe 4 
D? = X D w” (ta — i2) (jr — Tjo) 
E 3 
where (w) is the reciprocal of (wis), (i, j = 1, 2, +++, p). The exact 
distribution of D? on the hypothesis specifying real differences in mean 
* The subscript p in the symbol D,? denotes the number of characters used. The 


suffix may be omitted unless various D? values based on different sets or numbers of 
characters are to be kept distinct in any problem. 


MAHALANOBIS’ D? AND PROBLEMS OF TWO SAMPLES 247 


values is derived in 2d.2. To test the hypothesis specifying no difference 
in mean values of the p characters for the two populations, the statistic 


N,N2(Ni + Ne — p — 1) Dp? 
p(Ny + N2)(N1 + N2 — 2) 
can be used as a variance ratio with p and (VN; + Ne — 1 — p) degrees 
of freedom. 
As observed earlier, the above test can be derived in an interesting 
way suggested by R. A. Fisher. If the p measurements are replaced 
by a linear compound 


y = hay ++ ptp 
then the ratio of between to within variance of y from the two samples is 
NiN2 (Ldi +--+ lpdp)? 
Nı + No TTI ljw;; 


Maximizing this, we find that the coefficients of the best linear function 
separating the two groups are obtained as solutions of the equations 


hwy + bowie ++++= diu 
lwz + lywe + +++ = dou 


lwp + l2Wwp2 +*+ = dp 


where y is a constant. Observing that only ratios of J can be uniquely 
determined, we can replace » by unity and solve the above equations. 
Multiplying the above equations by lı, l2 +++ and adding, we find 


SDIdlweg = hdi ++ Indy = BBwided; = D? 


The optimum ratio is then 
e NN: 
N\N2 SSwildid; = N2 pp 
Ni + Ne Ni + Ne 
nee of this can be tested as shown above. 

The following tables (7b.3a and 7b.38), reproduced from 
n values based on 50 observations each and 
0 + 50 — 2) degrees of freedom for four 
lants Iris versicolor and Iris setosa. 

e Table 7b.68) 


The significa 
Example. 
Fisher (1938), give the mea 
the covariance based on (5 
characters in two species of p 

The solution of the equations (se 


lwi bee + aws = di i= 1,2,3,4 


248 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 
is obtained as 
L = —3.0692 lz = —18.0006 l = 21.7641 ly = 30.7549 
so that the discriminant function is 
—3.06922, — 18.0006x2 + 21.764123 + 30.754924 
The value of D? is 
hd; +--+ lpdp = 103.2119 
To test for the differences in mean values the statistic is 
NıNə(Nı + Nə — 1 — 4) D4 505095 
(Ni +N)(Nı +N2— 2) 4 100X98%4 
= 25. (26.3295) = 625.3256 


which as a variance ratio with 4 and 95 degrees of freedom is significant. 


103.2119 


TABLE 7b.3a. Observed Mean Values Based on 50 Observations Each for the Two 


Species 
Iris Iris 
Character versicolor selosa Difference 
Sepal length (z1) 5.936 5.006 0.930 
Sepal width (z3) 2.770 3.428 —0.658 
Petal length (x3) 4.260 1.462 2.798 
Petal width (z4) 1.326 0.246 1.080 


TABLE 7b.38. The Pooled Covariance Matrix (wij) Based on 98 Degrees of Freedom 


zı z z3 ois 
ti 0.195340 0.092200 0.099626 0.033055 
T2 0.092200 0.121079 0.047175 0.025251 
T3 0.099626 0.047175 0.125488 0.039586 
z4 0.033055 0.025251 0.039586 0.025106 


The method of evaluating the D? given above is useful because the 
best discriminating function is also found out during the process. This 
is useful in problems of classification as treated in the next chapter. 
7b.4 Test for an Assigned Discriminant Function 

In the last section the discriminant function for Iris versicolor and 
Iris setosa based on four measurements was found to be 

—3.06922; — 18.0006z2 + 21.7641a3 + 30.754924 


with the value of D4? = 103.2119. Since the mean measurements for 


TEST FOR AN ASSIGNED DISCRIMINANT FUNCTION 249 


versicolor exceed those for setosa except in sepal width (x2), a discriminant 
function of the type 

2 — Tat tst T4 
may be suggested. In such a case it might be of interest to know whether 
the discriminant function derived above is an improvement over the 
assigned simpler function. If the assigned function is represented by 
y, then 


s U I 
De = =———— 
V(y) 
where 9; and J are the mean values of y for the two species. 
J = 2 — fo + T3 + T4 


By 
hi — Jo = dı — dg + dz + d4 
= 0.930 — 0.658 + 2.798 + 1.080 = 5.466 
V(y) = Va) + V@2) + Vas) + Ves) — 
2 cov (xıx2) + 2 cov (x73) + 2 cov (xixa) + 
“2 cov (x324) — 2 cov (x2%3) — 2 cov (22's) 
= Wy + Wee + W33 + Wig — W12 — Bwog — Zweq + Zwig 
+ 2wi4 + 2W34 
= 0.482295 
using the values of wi; given in Table 7b.38. 
29.8771 
= (iene 61.9479 


To test whether the assigned discriminant function is in agreement 
with that derived from the data, the significance of the statistic 


1+ NyNoD?/(Ni + N2)(Ni + No — 2) _ 
U = F W,NoD,2/(Ni +NN: + No — 2) 
1 426.3998 pe 
1 + 15.8030 


0.6265 


has to be tested. The value of the statistic 


U(Nı + Ne -1- ay 0.6265 X 95 L S 
4—1 


250 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


as a variance ratio with 3 and 96 degrees of freedom is significant at the 
1% level. This shows that the assigned function is not the best dis- 
criminant of the two species. 

In general, if the assigned discriminant function is 


Y = ry +--+ apt 
then 
D2= (J — Je)? 
Vy) 


where V(y) = ZZa,a;w,;. If wij are estimated on n degrees of freedom 
and the mean values are based on Ñ, and N3 observations for the two 
groups, then 

U= 1+ NiN2D,"/(Ni + No)n 


= 1+NN2D2/(N, + Nan 


and 
U(n — p + 1) 
p—1 


can be used as a variance ratio with (p — 1) and (n — p + 1) degrees 
of freedom. This test is due to Fisher. 


7b.5 Tests for Discriminant Function Coefficients 

If four samples of sizes Ny, No, Ns, and N4 from populations A, B, 
C, and E are available, we can test whether the discriminant functions 
between A, B and C, E are significantly different by an extension of the 
test criterion discussed above. It is a necessary condition of the test 
that the variances and covariances are identical in the four populations 
A, B, C, and E. No reasonably simple test can be constructed to estab- 
lish the equivalence of the discriminant functions when this condition 
is not satisfied. 

Let (wz;) be the dispersion matrix based on (Ni + No +N + 
N4 — 4) degrees of freedom. If di, ---, dp are the differences in mean 
values for A and B, and dj’, ---, dy’ are those for C and F, the test for 
equality of discriminant functions and the associated distances is iden- 
tical with the testing of the hypotheses 


E(d;) = E@;) eo gep 
E(d,) = E(—d/) t7=1,2,---,p 


The variance ratios with p and n = (Ny + N + N3 + N, — 3 — p) 


TESTS FOR DISCRIMINANT FUNCTION COEFFICIENTS 251 


degrees of freedom for the two cases are 


2 IM) sswiita; — a’) (dj — 4) 
pn+p = jaa i i J J 
and 
n JN) i 
———— 220d; + di’) (d; + dj’) 
prntp—1l 
where 
uf 1 sie 1 + 1 J dl 
IN Ni Ne Nz M 


The equality of discriminant functions is indicated if at least one of the 
statistics is not significant. Similar tests can be constructed for judging 
the differences in discriminant functions in parallel samples from two 
populations or between A, B and A, C. 

If the equality of discriminant function coefficients are to be tested 
without considering the associated distance function, a suitable statistic 
is 

g(N)2Zw (d: — dd’) (d; — dd,;') (7b.5.1) 
where 


N : F + $9? ( EE 

g) N Ne N3 = x) 

and is chosen to minimize (7b.5.1). This minimum value may be used 
as x2 with (p — 1) degrees of freedom when n is large. 

Standard errors of discriminant function coefficients have been evalu- 
ated in an attempt to judge the significance of any single coefficient. 
There is some difficulty in this approach because discriminant function 
coefficients are not unique in the sense that they are the estimates of 
definite population parameters. What is unique is the ratio of any two 
coefficients, and an exact test is possible to test for an assigned ratio. 
For instance, if the ratio for the ith and jth characters is p, then we 


have to test whether the distance based on the (p — 1) characters 


Ty ttt Vendy Vids "s j-b Tas `*r Tp, Ti + px; 
is the same as that based on all the p characters 
Tis °°) Tp 

The statistic 1s NN: : 

J 2 

1+ ENM 
z N. 
= D’ 


1+ N, + NaN: + Ne — 2) 


252 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 
and to test for its significance 
Ni +Ne-—p-1 
1 i U 


can be used as a variance ratio with 1 and (Vi + No — p — 1) degrees 
of freedom. 


7b.6 The Additional Information Supplied by Some Characters 


Table 7b.6a gives the mean values of femur and humerus lengths of 
20 Indian and 27 Anglo-Indian skeletons. 


TABLE 7b.6«. Mean Values of Femur and Humerus Lengths 


Sample Mean Length of 
Size Femur Humerus 
Anglo-Indians 27 460.4 335.1 
Indians 20 444.3 323.2 
Difference 16.1 11.9 


The pooled estimates (on 45 degrees of freedom) of standard devia- 
tions are 23.7 and 18.2 and of correlation 0.8675. The D? based on the 
femur alone is 0.4614, which yields a significant variance ratio 5.301 
with 1 and 45 degrees of freedom. But the D2 based on the two char- 
acters, femur and humerus lengths, is 0.4777, leading to the variance 
ratio 2.685 which is not significant on 2 and 44 degrees of freedom. Here 
appears to be a dangerous situation where the inclusion of an extra 
character is not beneficial in discriminating between two populations. 
This leads us to the problem of studying the nature and number of 
characters which may be of use in discriminating between the groups. 
The first step in such a study is to develop a test to judge the significance 
of the additional distance contributed by the inclusion of some extra 
characters. The addition of such characters which do not increase the 
distance between the groups in the population will weaken the test. 

Even a small increase in distance will be helpful if the sample size is 
large. For instance, in the above example, with 10 more observations 
and an equal division of the sample size between the two groups, the 
observed D? would have been significant. 

Two problems then arise: first to test whether the inclusion of some 
extra characters increases the distance in the population, and second 
to estimate the additional distance and determine for what sample size 
this addition is useful. There is yet another practical issue which is 
relevant in problems of the next chapter where a number of measure- 


ADDITIONAL INFORMATION 253 


ments are obtained for assigning an individual to one of two groups. The 
error committed in such a classification depends on the distance be- 
tween the two groups, and an extra character added may increase the 
distance only by a trifle, in which case it may not be worth while to 
measure an extra character. 

To solve the first problem let p be the number of basic characters to 
which are added g more characters. Let samples of sizes Ny and No 
be available for the two groups containing measurements on all the 
p + q characters. If D®,+, is of the same order as D,”, then the ratio 


N\N2 


1+ D? 
po MENM + No = 2) 7 
NiN2 


D; 


I 
(M1 + No)(Nı + No — 2) 


is about unity. A high value of this ratio would indicate that D?,+, 
is significantly greater than D,” so that the q characters supply some 
additional information. The actual test is to use 


N+ Ne—p=—g=—1 
q 


Uap 


where U4,p = (R — 1) as a variance ratio with q and (N, + N2 — p — 
q — 1) degrees of freedom. 

In the example of Iris versicolor and Iris setosa we might ask the 
question whether sepal and petal lengths alone are sufficient for dis- 
crimination. In other words, does the inclusion of widths increase the 
distance? For this we need the value of D? based on the lengths only. 
It is useful to obtain the corresponding discriminant function also. The 
successive evaluation of D?’s and discriminant functions can be carried 
out as illustrated in Table 7b.68. This is essentially a method of pivotal 


condensation developed by Aitken (1933) but slightly modified to effect 


economy in entries. 
The D2? corresponding to 


so that 
1 + 50 X 50(103.2119)/100 x 98 
Uap = [4°50 X 50(76.7082)/100 X 98 


lengths only is 76.7082, and D4? = 103.2119, 


1 + 26.3295 


_+__ — 1 = 0.3287 
1 + 19.5684 


TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


254 


‘Op [[IM UUIM[OD yəyə eug *Aressav0u 
“UN OIG [VUOSEIP oY} Moaq səgzuə I} U) ‘UON}BULIOJUL [BUOTIPPB 10} 480} IY} UT SB ‘pəpəəu alt q JO sənjea əarssəvəns ATUO JT °C 
"UO OS PUL ‘091802°92 = (S62'Z)00ZSOT' Ig + 
(086°O) OSZTOT'TI ‘gg mox wos snyT, $a = (P)! asnvoaq “o[quyrear st yooyd [VUOTIppv UB .*¢7 pue (x)"7 Zururegqo Jo oSvIs 043 4Y “$ 
“UOHPONPaL JO 95BYS əvə 4B UUINJOD YyoaYo oy} Jo S}UOWI[I OY} SUTUIE}GO ur pəsn 1g UUNIOD JUO Nq 4sBI 944 UT SUNS STI, È 
‘sods Juənbəsqns 4B WY} UJI ‘MOI [wJOAId puodes 94} ur szuəwəjə Aq 
UUIN[Oo puodas oy} dn [[Y xluyBUL poonpar oy UT Fws PUOdS OY} 7B utungo puosəs əy} mo Suldoaais ur szuəwəjə osoy} UTEIOIT ‘aA0qB 
UMOYS SB Po}UOPU! OLB asoyT, 'MOI [vzoAId ysay IY} JO S}USUTO[9 oY} Aq UUINJOd 4sIY 94} UT [ry ‘MUMJoo S14 94} qO Zurdəsms IVV Z 
‘95BJS YOR 4B SMOI [BOAId ayy ƏV CF pur ‘oe ‘Of ‘OT SAO 'I 


* 
$ 
: 
. 


OS9LTS “SOT z d— = 0061IZ g01— 6ZGFSL "08 | 6F9000 8ST — GETFOL TZ LbC690°S— (ST Tř 
LOSSZE' S9 O66FSL “OE 


ihe EL 


I O¢6TSL "OT 969029 ` 93 890608 °§— OF 


A8L866°99 — ZŁ8ZIS'6L— —s-_ AT — = LI8O8E Z6 — : CSOPIZ'PI— |  9GZ6ET'IS —-9EzR0F'F— = (2)ET Ze 
8787390 O9SSFL "0 9001S "0 SIFII0'O į ZIZZZI'O |  ZZ8gog'0  S6egro'0— Ig 
86066°9 — SSOFIZ'FI —  ZILZZT'O I 188080°0  SzT020°9 0g 


EcEFIL"88 —  £Z1609'L9— d- = ONTSOL"9L — OOZSOL:TE} OSZTOTT = @)*T & 
16IZSZ'0 LO9SS "0 ZOLSTS 0 I8SZ10°0 : 9TTFOS'O :  6SEPTO O GG 
CSIP O — —-gezepg'g — SI9ZOT'E —  6IS600'0  Z29220'0 967000 |  Z28027'0 12 
osesez "GE 00ZSOT ` TE 9TIFOE'O 9687000 I 889928"9 03 


E2266 °Z ISezsr's — Iq- = 198Z} — POGIOL'F) = (@)'T FT 
PITPLG'O LEQEPI'T 1882760 067610`0 : ESF69T`0 | SI 
1Z600°T —  239289'0 L¥0L60'T —  ¥$29600°0  — $48220°0 : P6032470 | GI 
LIGIZP'S GOSTEG "SZ PILEZE “S 61Z620°0  —- 6LT000°0 S0LF40°0 =, $86609 "0 | IT 
99FET6"9 FOGTOL'F ESF69T°0 #603270 eg660s'0 ` I or 


Oost’ E 000°0 co 

0£€02 `T 080°T ITSZ0'0 , +0 

€cLE°0 — soo = 6220 °0 TISTO £0 

6601'E 86L °g 96€0`0 GLEO 0 SST O GO 

GOSE T 0€6°0 T€E0°0 66600 . 9660°0 €S61°0 10 

pəzuəpug oy} poyuepuy əy} (p) suvayT ut Pr ox ET Ir ‘ON 
SUIPNOXAy Sulpnpouy DUAYA a Moy 

yA umg XJE J uorIdsiq 


(2)'7 suooung quvurunosıq pur sonw q OAISSODONg Juuwgqo 10} Poy WOHVsuapUoD [VJA “gg'qy ATAVT, 


ADDITIONAL INFORMATION 255 


The variance ratio with 2 and (VN; + No — p — q — 1) = 95 degrees 


of freedom is 
25(0.3287) = 15.6132 


which is significant, showing that widths are useful in addition to the 
lengths. The additional distance in such a case is determined. by 


D?»44 — Dp? = 103.2119 — 76.7082 = 26.5037 


9 


A question may be asked as to why the difference D*,.., — Dp” could 
not be tested directly. The distribution of this difference involves the 
population value of the distance based on the first p characters, and 
unless this is known no exact test of significance can be made. On the 
other hand the statistic U4,p, which also gives a comparison of the two 
D? values, is distributed in a simple manner, and there is not the problem 
of any nuisance parameter (5c.3). If the samples are large and the 
population value An” of D,” is not large, the distribution of the differ- 


ence D?,., — Dp” is independent of A,” to a large extent. In such a 


case 
No—p-—l Ni +No-p-—q-1 
ja a (7b.6.1) 
Nı +N2— 1 q 


can be used approximately as a variance ratio with g and (N; + N2 — 
p — q — 1) degrees of freedom, where 


NiN2 


y, = —— M (D — D,”) 
War = N, FNAM ENa P 
In the above’example, 
Wop = 26.3295 — 19.5684 = 6.7611 
97 x 95(6.7611) = (6.6245) = 314.6637 

which is a very high variance ratio compared to that obtained on the 
basis of Ug,». The approximation here is very crude (it always over- 
estimates significance), especially because D,” happens to be very high. 

Cochran and Bliss (1948) considered a situation where initial intelli- 
gent quotients (I.Q.) can þe used as concomitant variables in studying 
the differences introduced by two types of training. For this it 1s sug- 
gested that a sample may be divided at random into two groups, each 
of which is required to take a different training. This means that with 
respect to initial I.Q. values the two groups can be regarded as having 
come from the same population so that an exact test based on Wap is 
possible. The exact distribution of Wq,p, even in this case, is a bit 


256 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


complicated, but a good approximation to this is the variance ratio 
considered above (7b.6.1). 

It is interesting to observe that U,,» in any case is distributed inde- 
pendently of the first p variables. The only condition needed for the 
test is that, given p measurements, the expected value of any other 
measurement is a linear combination of the first p. If this is so, it is 
not necessary that the first p variables be observed at random. If the 
problem is to test whether g additional characters discriminate between 
the two populations independently of a basic set, it might be profitable 
to select samples from the two populations such that they agree on the 
average, as far as possible, in the basic set of p characters. In the prob- 
lem of Cochran and Bliss the initial I.Q. values may be used to effect 
such a division. The only test available in such a case is that based on 
U, . If the sample is divided at random into two groups, the test 
based on W4,p is theoretically more accurate. 

The following conclusions will be useful in studying the proBlems of 
this nature. 

1. Whatever may be the number of characters chosen to discriminate 
between two populations, it is profitable to divide the samples equally 
between the two populations. 

2. Unless the samples are large, it is not profitable to consider nearly 
related measurements in tests for discrimination. 

3. To judge the significance of increase in distance due to the inclu- 
sion of g extra characters to a basic set of p, it is advantageous (from 
the point of view of the test) to choose the individuals from the two 
populations such that they may agree, as far as possible, in the average 
measurements of the basic set. ° 

In 3f.3 some comments were made as to the applicability of a pre- 
diction formula found from one series for an individual of an unrelated 
series. Thus a question might be asked whether a prediction formula 
for cranial capacity deduced from the measurements on an Anglo- 
Saxon skull can be used to predict the capacity of an Indian skull, 
providing measurements on its length, breadth, and height. This is 
a very important problem because, in the absence of suitable data, it 
may be necessary to use the formula derived from a different series. 
As observed earlier, one condition may be that the internal relation- 
ships of the measurements should be the same for the two series, which 
means that the variances and covariances of the measurements should 
be the same. This requirement is very often satisfied with biometric 
material; the differences, when they exist, are quite small. This is not 
enough. It must be known that the two series are such that the whole 
distance between them should be capable of being explained by the 
differences in the characters used for prediction. In other words, if 


GENERALIZATION OF D? 257 


Tı, T2, * ` *, Lp are the characters used for the prediction of a character y, 
then these measurements must be such that the additional distance due 
to y on eliminating 2, ---, £p is theoretically zero. Only in such cases 
can a single formula be applicable for two series. This can be verified 
by the methods developed here. In fact the methods are applicable 
in a more general case requiring the prediction of more than one char- 
acter with the help of a basic set. 


7c Generalization of D? and the Large Sample 
Theory for Several Groups 


Let there be k multivariate populations A1, As, +-+, Ap from which 
samples of sizes Ny, No, +++, Nz are available for (p + q) characters. 
The common covariance matrix assumed to be known or estimated on 
a large number of degrees of freedom is represented by (a;;)p for the 
first p characters and by (a;j)p4q for all the (p + g) characters. The 
inverse of (a;;)p is represented by (ap’’), and that of (æij)p+a by (a%p4q). 
Let tj, 2, +++, be the mean values of the ith character in the first, 
second, etc., populations. 

It is shown in 2c.3 that the statistic, 


Pp k 
Vox = TD a” DN (Er — 8) (Ee — ;) 

ij=1 r=1 
where @; = (XN,ir)/(©N+), can be used as x? with p(k — 1) degrees 
of freedom to test the hypothesis that the mean values are the same in 
all the k populations for these p characters. The statistic V is a suitable 
generalization of Mahalanobis’ D? in its classical form. 

When this test indicates differences in mean values it is in some 
problems necessary to test whether the observations on g additional 
characters supply independent information for discrimination. The sta- 
tistic for testing the differences in means for all the p + q characters is 

pHa k 
Vip+ok = 2A an +g 2 N,(Ēir — Zi) (čir — 2) 


ij=l 


which can be used as x? with (p + 9)(& — 1) degrees of freedom. The 
q(k — 1) additional degrees of freedom bring in the contribution 


Vip+ok — Vor 


and the significance of this difference can be appropriately used to judge 


: i i iti ters. 
igni f the information supplied by the additional charac 
ati x2 with g(% — 1) degrees of freedom, as 


This difference can be used as 
shown below. 


258 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


The hypothesis that the new characters do not lead to further dis- 
crimination of the populations specifies that any linear function of the 
(p + q) characters uncorrelated with each of the p characters has the 
same mean value for all the k populations. There are q such linear 
functions and, if they are treated as q variables, a x? with g(k — 1) 
degrees of freedom can be constructed to test the above hypothesis. 
The above method of taking the difference is only an alternative way 
of calculating this x’; for, V (@+a)k calculated from all the (p + q) char- 
acters, being invariant under linear transformations of the variables, 
is equal to Vp} + x? calculated from the p original characters and the 
qlinear functions chosen to be uncorrelated with each of the p characters. 

In the above derivation it has been assumed that the variances and 
covariances are known, and the distributions are asymptotically true 
when they are estimated on a large number of degrees of freedom. When 
more than two populations are involved the pooled estimates of the 
covariances usually have a sufficiently large number of degrees of free- 
dom to validate the use of the asymptotic distributions. More exact 
tests for cases involving small numbers of degrees of freedom are given 
in the next section. 


7d Tests with Wilks’s A Criterion 


7d.1 Analysis of Dispersion and the Theoretical Aspects of 
the A Criterion 


In the univariate analysis of variance, tests of significance reduce to 
the comparison of two independently distributed mean squares. One 
of the mean squares is an unbiased estimate of the variance to which 
a single observation in any particular class is subject and is called the 
error variance. The other is an unbiased estimate only when the null 
hypothesis which is being tested is correct and may be called the mean 
square due to deviation from the hypothesis. The test depends only 
on the individual degrees of freedom of two mean squares. 

When each sample supplies p mutually correlated variables, there are 
p total sums of squares and p(p — 1)/2 total sums of products which 
can be analyzed into various categories. This process, which involves 
the technique of analyzing the variances and covariances of multiple 
correlated variables, may be termed the analysis of dispersion. The 
term dispersion was originally used by P. C. Mahalanobis to indicate 
the scatter of a set of observations as measured by the variances and 
covariances. Following this terminology, the total dispersion may be 
said to be analyzed into dispersion due to various categories. 

If we represent the total sums of products by the matrix S = (Sy), 
then the analysis of dispersion consists in analyzing each element such 


ANALYSIS OF DISPERSION 259 


as S;;, according to the usual procedure, into various categories with 
the corresponding distribution of degrees of freedom. The dispersion 
due to any category supplies the sum of products (S.P.) matrix which 
on division by the degrees of freedom gives the mean product (M.P.) 
matrix. The S.P. matrix leading to unbiased estimates of the variances 
and covariances to which a single set of variables is subject is.called the 
S.P. matrix due to error. This error matrix may be denoted by W 
with w as its degrees of freedom. + In the analysis of dispersion the S.P. 
matrix due to any other category leads to unbiased estimates of vari- 
ances and covariances only when the null hypothesis regarding that 
category is true. This may be called the S.P. matrix due to deviation 
from the hypothesis. If such a matrix is represented by Q with q as 
its degrees of freedom, then the problem of testing the null hypothesis 
consists in comparing the matrices (1/w)W and (1/¢)Q. The simul- 
taneous comparison of the estimates of the variances and covariances 
appears to be a natural extension of the comparison of variances in the 
case of a single variate. 

The appropriate test criteria for comparison may be obtained by 
extending the method of discriminant function analysis. A linear com- 
pound of the variables is taken, and the compounding coefficients are 
chosen such that the ratio of mean squares due to deviation from hypoth- 
esis and due to error for this variable is a maximum. The ratio fe 
which comes out as a root of the determinantal equation 


=0 


q-ipw 
w 


may be used as the appropriate test criterion. If | W | = 0, the number 
of non-zero roots of this equation is equal to the number of variables 
under consideration or g, the number of degrees of freedom of Q, which- 
ever is smaller. An adequate comparison of (1/g)Q and (1/w)W must 
involve the tests of significance of all the roots. If fi, fe, ++- represent 
the various roots, it is easy to verify that 
W+Q| 
LPE (ari) aa 
(1 + wt ) eE r] 
A decreases as the magnitude 
ratio | W|/| W+R | denoted by 
aa asl ma and a significantly small value of A may be taken 
: 5 ridin the significance of one or more of the roots. This is the 
= a lyi E of the A criterion arrived at by Wilks (1932) by 
safe ‘ite Tikelihood ratio method and later extended by Bartlett (1934) 
ie in multivariate analysis. 
i a indie not provide a satisfactory test, for when only one 
= aler number of roots than the total indicate real differences, their 
or a smal 


260 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


significance may be obscured by the use of the overall test. Its use can 
be recommended only in situations where small deviations from the 
hypothesis can be ignored. 


7d.2 The Distribution of A and Its Practical Use 
The following notations will be used throughout this and the subse- 
quent sections. 


TABLE 7d.2a. Analysis of Dispersion for p Variables 


Due to D.F. S.P. Matrix 
(1) Deviation from hypothesis q Q 
(2) Error n—q wW 
(3) Total n W+Q 
| W| 
A = ~ 
|Ww+Q| 


If the number of variables involved is p, then, assuming that the 
elements of W are distributed independently of those of Q, it is easy to 
derive the tth moment of A. 

21 Pign — Hrn — g-i) +1 
mA» = TI H )} (z( n q ) } 
io T{a(n — g — a) }T {a(n — i) + 8} 
The tests based on the exact distributions given by Wilks (1932) and 
Nair (1939) for some particular cases obtained by a comparison of 
moments are reproduced below. 


Nature of the Test 


Variance Ratio Degrees of Freedom 


q = 1, for any p =e = p and (n — p) 
{= VA nap aa 
q = 2, for any p sae SS 2p and 2(n — p — 1) 
VA p 
1—4 — 
p = 1, for any q F Fa q and (n — q) 


p = 2, for any q 


———— ] 2g and Ain -9—1 
i 3 a (n —q —1) 


THE DISTRIBUTION OF A AND ITS PRACTICAL USE 261 


For other values of p and q, the exact values of the probabilities can 
be obtained by the use of x? tables alone. Defining 


1 
2t) loge A 


V = —m log. A = — (»- 


the distribution function of V can be obtained in the asymptotic form 


Y2 1 
Ppa + z (Prats — Ppa) + 4 
m m 


X {va(Ppa+s — Ppa) — Y2? (Ppap — Ppa)} +++: 
where Ppq+r is the distribution function of x? with (pg + r) degrees 
of freedom. If m is large, the first approximation consists in using V 
as x? with pq degrees of freedom. For obtaining the second and third 
approximations the expressions for y2 and y4 are 


AEE 
=— (P +g — 5) 
Fa 18 (p 


2 
Y2 , PL [3pt + 3g + 10p?¢? — 50(p? + g?) + 159} 


v=o * 1920 


In many practical problems the first approximation suggested by Bart- 


lett can be used. : 
Defining the statistic 


m gee 
y=A Saa PHS 


the distribution function of y can be written 


ms 
Eee a+r) 
on or (+ mis ) 
2(F4a7)t+7—___y 
2 r(M4atr+4) 


262 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


where 
T(r + 4) l Yast j (r — 1)(5r — 2) 
T(r) 16r(r + 1)(r + 2)(r + 3) 10 X 16 X 36 
Pq 
pies 
2 
—2 
ya M-i 


4 


and B(t, u) is the distribution function of the beta variable. The first 
term above offers a powerful approximation, the second term being 
O(1/m*). In this case the statistic 


1— Al? ms + 2) 
Alls 2r 


can be used as a variance ratio with 2r and (ms + 2h) degrees of freedom. 
The quantity (ms + 2d) need not be integral. 


74.3 Test of Differences in Mean Values for Several Populations 

Let Ay, A2, -++, Ay be k populations from which samples of sizes 
M1, +++, Ni for p correlated variables are available. The disperison 
has to be analyzed into between and within populations. The S.P. 
matrix due to within populations (or the error) has (Ny +-+--++ Ny — k) 
degrees of freedom, and that due to between populations has (k — 1) 
degrees of freedom. If these are represented by W and Q, then the 
statistic to be used for testing the differences in mean values is 


V = —mlog-A 
lw 
[WQ] 
where 
p+a+1ı1 
menas 
2 
n=Ni Feti 
q=k-1 


The exact probability of V’s exceeding the observed value can be calcu- 
lated as explained in 7d.2. 

Example 1. Table 7d.3a gives the analysis of dispersion for the three 
characters, head length (x1), height (x2), and weight (x3), measured on 


DIFFERENCES IN MEAN VALUES 263 


140 schoolboys, of almost the same age, belonging to six different schools 
in an Indian city. 


TABLE 7d.3a. Analysis of Dispersion 


Dis- S.P. Matrix 

persion | D.F. 

Due to oe zè xe rite Zits nat 
Between 

schools 5 | (Qi) 752.0 | 151.3 | 1,612.7 | 214.2 | 521.3 | 401.2 
Within 

schools | 134 | (Wx) | 12,809.3 | 1499.6 | 21,009.6 | 1003.7 | 2671.2 | 4123.6 
Total | 139 | (S,,) | 13,561.3 | 1650.9 | 22,622.3 | 1217.9 | 3192.5 | 4524.8 


12,809.3 2142 521.3 

214.2 1,499.6 401.2 

Iw] 521.3 401.2 21,009.6 
A= Ts] | 13,5613 1,217.9 3,192.5 
1,217.9 1,650.9 4,524.8 

3,192.5 4,524.8 22,622.3 


10!2(0.176005) 
= 10!2(0.213628) 
— log, A = 0.193724 
m = 139 — ł(5 + 3 + 1) = 134.5 


y = —m logs A = (134.5) (0.193724) 


= 26.0559 


Using V as x? with pg = 15 degrees of freedom, the first approximation 
comes out as Pig = 0.0375 


The second term is 
aa (Pa Pis) 
m? 


264 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


where 
29 X 15 Y2 29 X 15 
y2 = —— and 


— = ——~. = 0.00050096 
48 m”  48(134.5)” 


= (Pig — Pis) = 0.00050096(0.1285 — 0.0375) = 0.00004574 
m 


This correction to the first approximation affects only the fourth 
decimal place so that correction is hardly necessary. The observed value 
of V is significant at the 5% level, showing thereby that boys of various 
schools differ in physique. This appears to be generally true since boys 
belonging to different social strata attend different schools. 

To use the variance ratio approximation we find 2r = 15, s = 2.67, 
ms + 2 = 352.61. The variance ratio 1.77 with 15 and 352.61 degrees 
of freedom is significant at the 5% level. 


7d.4 Internal Analysis of a Set of Variates 


Let 21, ***; Tp, Ts+1; ***) Tepp be (s + p) correlated variables for 
which samples of sizes Ny, +++, Np are available from k populations. 
If the differences in mean values of these (s + p) variables are to be 
tested for significance, then the method given in 7d.3 can be used. An 
important problem that arises in biometry is to test whether the varia- 
bles, say 41, **', Ts+p, bring out further differences in populations 
when the differences due to 24, +*+, £s are removed. 

It is apparent.in problems of this nature that some of the variables 
in the set tı, -+:, £s might be in the nature of concomitant variables 
which have been observed in association with the dependent variables 
or which might have been chosen to have some specified values. An 
illustration of such an analysis is found in a problem where three depend- 
ent variables g, h, and i, corresponding to linear, parabolic, and cubic 
terms of growth curves of pig weights, are considered together with a 
concomitant variable w giving the initial weight of pigs. It was desired 
_ to test whether the variables h and i bring out further differences in 
food treatments when ‘the differences due to g and w are eliminated. 
The problem is identical with that posed above, with g, w forming the 
first set and h, i the second set of variables. 

There is a third set of problems in which it is desired to test whether 
the differences in k groups characterized by (s + p) measurements can 
be explained by variations in s assigned linear functions of these measure- 
ments. If y1, ***, Ys+p are the (s + p) variables and 


Li = mays ++ Mi, p+sYp+s 


Li = mayi H+ MaptYUpte 


INTERNAL ANALYSIS OF A SET OF VARIATES 265 


are the assigned linear functions, then we can replace the (s+ p) 
variables y1, +++, Ys+p DY Ti °°» Us+py defined by 


ay = Th, + te = Ly 


Tet1 = M41141 HiH Mst1s+pYp+s 


Ts4p = Meppayi beet Ms+pstpYp+s 


where the coefficients in Ts+1, *'', Ts+p are chosen arbitrarily subject 
to the condition that the determinant | myl, 7 = 1, 2 + + p) 
is not zero. This latter condition ensures that the transformation from 
the y to the x leads to one-to-one correspondence. Once again, the 
è problem is reduced to that of considering the differences in tepu s 
Xs4p When those due to tı, ***, % are removed. The proposed test 
is independent of the compounding coefficients used to define the set 
Tapi, ** y: Tagy 80 that, in any practical problem, they may be con- 

veniently or conventionally chosen. 
blem is one of analyzing the dispersion of 


In all these cases, the pro! c i 
when the dispersion due to 11, ++", Ts i5 


the variables ts41, '*'» s+? 7 : A : 
removed. This can be done by following the covariance technique suit- 


able for p dependent variables and s independent variables as indicated 
in 8¢.2. 
Let T 
Ba = Qu) + Wa) I = 1, 2, +++, (8 + p) 
all the (s + p) variates due to deviation 


dispersion for 
the corresponding distribution of degrees 


be the analysis of 
with 


from hypothesis and error 
of freedom as y 
n=qt@-@ 


The S.P. matrix due to error for the variables 21, +++, ts to be eliminated 


18 
Wi FEKS Wis 


rse is represented by 
wi! Pal wi 


and its inve 


266 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


The S.P. matrix due to error for 2341, ***, %s4p» When corrected for 
21, ***, Ze is given by W(s +1, =», s+ p|1, =, s) or simply 
W (p | s) where 


Wears or Wetietp 
Wpl|s)=|- °c cee 
Wo+pe+1 S, Wetpetp 
Wuya ee Wap [We = WY) Pras ee Wisse 
Wisto RS Was+p Wh sa W” Was+1 an Wostp 


This form, which involves the evaluation of a triple product of matrices, s 
appears to be conveneint for computation as illustrated in the next 
section. Another way of obtaining this matrix W (p | s) is to start with 
the complete matrix (W;;), (i, j = 1, 2, ++, s, s+ 1, +++, s + p) and 
reduce it s times by the method of pivotal condensation starting from 
the element W11. Replacing W by S, we have the formula for computing 
the S.P. matrix due to “deviation from hypothesis + error” for £s+1, 

**, Zs4p When corrected for 2, +++, zs. If this is represented by 
S(p | s), then the required criterion is 


W(p| s) 
S(p | s) 


The degrees of freedom for W(p |s) are (w — q — s), and that for 


S(p | s) are (n’ — s), so that in standard notation the parameters asso- 
ciated with A are 


n=n'—-s p=p q=q 
The test can be carried out as discussed in 7d.2. 


74.5 Barnard’s Problem of Secular Variations in Skull Characters 

The problem of measuring secular variations in skull characters con- 
sidered by Barnard (1935) is of immense importance to the anthro- 
pologists. It is, however, of interest to examine the methods employed 
by her in the light of the latest developments in multivariate analysis. 
The two problems involved in her study are: 


SECULAR VARIATIONS IN SKULL CHARACTERS 267 


(i) The selection of a smaller number, out of seven skull characters, 
which give significant information, so far as is possible, as to changes 
taking place with time in four series of Egyptian skulls; and 

(ii) The determination of an expression, linear in measurements, which 
characterizes most effectively an individual skull with respect to the 
progressive secular changes. 

To answer problem (i), Barnard first chose basialveolar length and 
nasal height as two basic characters which, independently of each other, 
show significant variation in the four series. To choose further char- 
acters she considered the problem of testing the significance of the 
linear regression of the mean values of an added character with time 
(corresponding to the four series) when that part of the regression due 
to the two basic characters is removed. This meant the choice of 
characters with special reference to the average linear rate of change 
of the individual means with time. If the choice of characters is to be 
with reference to the complete nature of changes taking place with 
time, then what is needed is an internal analysis of the characters to 
decide whether the configuration of the four series as determined by 
several characters is the same as that indicated by a smaller number. 
Barnard’s method should, of course, be preferred if the regressions were 
known to be linear. This can, however, be tested from the data. 


Taking the four measurements 


basialveolar length 


1 


vi 
zə = nasal height 


= maximum breadth 


t4 = basibregmatic height 
arized in Tables 7d.5a and 74.58, which 


x ta are summ : 1 l 
e ga ries and the analysis of dispersion. 


give the means for the four se 


Tapp 7d.5a. Means for the Four Series 


Series 
I I III IV 
= = 70 Ny = 75 
=91 Ne = 162 Ns 
eo 134.265432 134.371429 135.306667 


2418 4.37 

se T ; pa 55 462963 95.857143 95.040000 
i 0.835165 51 148148 50.100000 52.093333 
K ie “900000 134. 882716 133.642857 131 .466667 


T4 


TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


Tase 74.58. Analysis of Dispersion (S. P. Matrix) 


Dispersion due to 


Between Within Total 

3 DF. 394 D.F. 397 D.F. 
zy 123.180628 9661.997470 9785.178098 
xe 486.345863 9073.115027 9559.460890 
x3" 100.411505 3938.320351 4088.731856 
xe 640.733891 8741 . 508829 9382.242720 
ayo —231.375635 445.573301 214.197666 
2103 87 .305348 1130 .623900 1217 929248 
T124 —128.763994 2148.584210 2019.820216 
xorg —107.505618 1239.221990 1131.716372 
Toti 125.313318 2255.812722 2381 . 126040 
T34 —137 .580764 1271 .054662 1133 .473898 


Example 1. Do the characters xg and T4 show significant variation 
in the four series independently of the variation due to the characters 
ay and z2? 

The method developed in 7d.4 is directly useful in this problem. The 
S.P. matrix within for the basic characters xı and 22 is 

Wu ei a ieee acs cha. 
Wor Woo)  \ 445.573301 9073.115027 


Its inverse is 


il rje =( 1.037332 pua 
w we) —0.050942 1.104659 


The within S.P. matrix for x3, z4 due to 21, 22 is given by the triple 
product 


hee a) Ce = eo eo 
Wis Wess \W W??/\Wog Woas 
ie sence eins wi La Tg = 


2148.584210 2255.812722/ \W?! W??/\Woes W24 


-i lee sco an) oe a 
2113.879535 2382.450625/ \Wə3 Wə4 


a $ 


pie epee 
~ \534.238796 991.621041 


SECULAR VARIATIONS IN SKULL CHARACTERS 269 


The within S.P. matrix for x3 and zy after correcting for x; and qə is 


jea ea ibs s ot 3 on W = 
Ws, Was Wis Woa) \W?? W?/\Wog Wos 
7 emp ee) lanes ae 


1271.054662 8741.508829 534.238796 991.621041 


3650.353731 736.815866 
-( rog = KEND 
736.815866 7749.887788 


This has 394 — 2 = 392 degrees of freedom. Similarly, S(2 | 2) with 
397 — 2 = 395 degrees of freedom is 
3809.335190 ee) 
611.798381 8393.755848 


_ |W2|2)| _ 0.27746934 _ 


pon ist a Be E = 0.87805 
[S@[2)|  0.31600332 sais 
pt+tqtl 2+3 
= —m log. A man T ggs EET L 302 


V = —392 log, (0.878058) = 51.39 


This value of V with pg equal to 6 degrees of freedom is significant so 
that v3 and 24 may be considered as discriminating the series inde- 
pendently of xı and 22. 

The above method could be simplified by starting with the full 
matrices W and S and reducing them by the method of pivotal con- 
densation. The four pivotal elements for W are 


10+ (0.966200, 0.905257, 0.365033, 0.760117) 


and for S 
104(0.978518, 0.955477, 0.380933, 0.829550) 


The value of | we | 2) | is the product of the last two pivotal elements 
108(0.365033) (0.760117) = 0.277469 X 108 
Similasty, | s2 | 2) | = 0.316003 X 10°. Thus we obtain the same value 
of A as above. A. 
Example 2. Taking the relative times between the series in the pro- 
portion 2:1:2, can the variation of the characters be accounted for by 
the linear regression of individual characters with time? 


b 


270 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


Tn order to obtain the regression with time, the values of t, the time 
variable, may be taken as —5, —1, 1, and 5 for the individuals of the 
first, second, third, and fourth series, respectively. The calculation of 
individual regressions involves the quantities, 


E(t — 1)? = 4307.66832 
Zr (t — 1) = 718.76286 Da3(t — f) = —410.10194 
Dxo(t — 2) = —1407.26075 Day(t — t) = —733.42758 


The matrix R with 1 degree of freedom giving the squares and products 
due to regression is given in Table 7d.5y. 


Taste 7d.5y. Matrix R with 1 Degree of Freedom 


Tı T2 T3 


xy 
zi 119.930358 —234.810812 68.428235 —122.377258 
z —234.810812 459.734449 —133.975163 —149.601596 
3 68.428235 —133.975163 39.042852 — 69.824358 
xz, —122.377258 —149.601596 — 69.824358 124.874099 
~ 
In the above table 
{Zat — a}? 
a a SS aE 
X(t — t)? 
Rig = {Zx (t — f)} {Zae(t — #)} 
a = u- and so on 


With these results we analyze the dispersion of which a typical product 


(x22) is chosen below for illustration. To test the hypothesis that the 
regressions are linear we compare W and Q + W. 


TABLE 7d.56. Analysis of Dispersion 


- i Due to D.F. S.P. Matrix (x12) 
rs ae 1 —234.810812 (Rij) 
eviation from regression * 2 3.435177 (Qi) 
4. 
Total (between series) 3 23 
tal à —231.375635 Rij ij, 
Within series 394 445.573301 m ‘i “o 
1). 
Total 397 214 
l ; .197666 (S;) 
Deviation from regression 396 44 H 
+ Within series HONEEI (OM WO 


* Obtained by subtraction. The complete matrix (Qu + Wi) obtained by the 


above method is given in Table 7d.5e. This is th ; , 
matrix R due to regression. e total S.P. matrix minus the 


| 
| 
| 


REFERENCES i 271 


TABLE 7d.5e. Matrix (Q + W) with 396 Degrees of Freedom 


Tı T2 T3 T4 
9665.247740 449.008478 1149.501013 2142.197474 
449.008478 9099.726441 1265.691535 2231.524444 
1149.501013 1265.691535 4049.689004 1203.298256 
2142.197474 2231.524444 1203 . 298256 9257 .368621 
wW 0.24269054 X 10’ 
A= | | = — 0.90307436 
|Q+W|  0.26873816 x 10" 
2+4+1 
V=- {306 a } tog. (0.90307436) 


= 40.02 


The x? approximation has p X q = 2X4 degrees of freedom, since 
Q has 2 degrees of freedom and there are four variables. The result is 
significant so that the regressions cannot be considered linear. j 

This test can be extended to examine whether a parabolic regression W 
with time can explain the differences in mean values. The matrix Q 
giving the deviation from regression has then 1 degree of freedom and 
R due to regression 2. , ; 

To determine the coefficients of a linear compound which characterizes 
most effectively the secular changes in progress, Barnard maximized v 
the ratio of the square of unweighted regression of the compound with 
time. It is doubtful whether such a linear compound can be used to 
specify an individual skull most effectively with respect to progressive 
changes, since linear regression with time does not adequately explain 


all the differences in the four series. 


References 


ArrKen, A. C. (1933). On fitting polynomials to weighted data by least squares. 


Proc. Roy. Soc. Edin., 54, 1. n 5 
BARNARD, M. M. (1935). "The secular variation of skull characters in four series of 
Egyptian skulls. Ann. Eugen. London, 7,89. A 
BARTLETT, M. S. (1934). The vector representation of a sample. Proc. Camo. 
Phil. Soc., 30, 327. . : Poe 
BARTLETT, M. S. (1938). Further aspects of the theory of multiple regression. TOC. 
Camb. Phil. Soc., 34, 33. ? 
Barruerr, M. S. (1947). Multivariate analysis. J.R.S.S. Suppl., 9, 76. ; 
Cocuran, W. G., and C. I. Briss (1948). Discriminant functions with covariance. 
Ann. Math. Stats., 19, 151. , i 
Fisuer, R. A. (1936). "The use of multiple measurements in taxonomic problems. 


Ann. Eugen. London, 7, 179. 


272 TESTS OF SIGNIFICANCE IN MULTIVARIATE ANALYSIS 


Fisuer, R. A. (1938). The statistical utilization of multiple measurements. Ann. 
Eugen. London, 8, 376. 

Fisuer, R. A. (1939). The sampling distribution of some statistics obtained from 
nonlinear regression. Ann. Eugen. London, 9, 238. 

Frets, G. P. (1921). Heredity of head form in man. Reprinted from Genetica, 
The Hague, Nijhoff, 3. fa 

Hore ine, H. (1931). The generalization of Student’s ratio. Ann. Math. Stats., 
2, 360. 

HoreLLING, H. (1936). The relation between two sets of variates. Biom., 28, 321. 


Hsu, P. (1939). On the distribution of the roots of certain determinantal equations, 
Ann. Eugen. London, 9, 250. 


Manatanosis, P. C. (1936). 
Inst. Sci. (India), 12, 49. 


Narr, U.S. (1939). The application of moment functions in the study of distribu- 
tion laws in statistics, Biom., 30, 274. 

Newman, J., and E. §. Pearson (1928). On the use and interpretation of certain 
test criteria for purpose of statistical inference, Biom., 20A, 175. 

Nerman, J., and E. S, Pearson (1931). On the problem of k samples. Bull. Int. 
Acad. Cracovie, A, p. 460. 


Pearson, E. S., and J. Neyman (1930). On the problem of two samples. Bull. 
Int. Acad. Cracoie, A, p. 73. 


On the generalized distance in statistics. Proc. Nat. 


Rao, C. R. (1946). Tests with discriminant functions in multivariate analysis. 
Sankhyd, 7, 407. 


Rao, C. R. (1949). On some problems arising out of discrimination with multiple 
characters. Sankhyd, 9, 343. 


Roy, S. N. (1939). p-Statistics or some generalizations in analysis of variance ap- 
propriate to multivariate problems. Sankhya, 4, 381. 
JALD, A, and R. J. Brooxner (1941). On the distribution of Wilks’ statistic for 
« testing independence of several groups of variables, Ann. Math. Stats., 12, 137. 
A TE R S. (1932). Certain generalizations in the analysis of variance. Biom., 
» 471. 4 
Witks, S. 8. (1935). On the independence of k sets of normally distributed sta- 
de tistical variables, Econom., 3, 309. 


Wisuarr, J. (1928). The generalized 


28). product moment distribution in samples from 
a normal multivariate population. 


Biom., 20A, 32. 


GHAPTER 8 


Statistical Inference Applied to 
Classificatory Problems 


8a Tests of Null Hypotheses 


8a.1 Problems in Biological Research 

There are two types of problems confronted in biological research. 
The first is that of specifying an individual as a member of one of many 
groups to which he can possibly belong, as when a taxonomist has to., 
assign an organism to its proper species or subspecies or an anthropolo- 
gist is faced with the problem of sexing a skull or a jawbone. The sec- 
ond is the problem of classification of the groups themselves into some 
significant system based on the configuration of the various character- 
istics. The need of this is felt in the study of systematics and the evolu- 
tion of species. A number of species or subspecies may have to be 
arranged in a hierarchical order showing the closeness of some and the 
distinctness of others. Such a representation superimposed on a geo- 
graphical classification may, it is suggested, be of use in tracing the 
evolution of various species or subspecies. . 

The solution of these problems requires the development of a suitable 
theory of statistical inference and the formulation of some practical 
Tules of procedure which the biologist can profitably use. 

To start, it is useful to distinguish problems of discrimination from 
those of testing of hypotheses. Recently there has been a tendency to 
treat both these problems on an equal footing, and this has no doubt 
caused a good deal of confusion. In testing of hypotheses we have a 
clearly stated null hypothesis and a comparatively undefined set of 
alternatives. The emphasis is more on the null hypothesis, which may 
be rejected or provisionally accepted. When a null hypothesis is re- 
jected no decision is made about the actual alternative hypothesis. But 
in problems of discrimination we have a class of alternative hypotheses 
out of which one has to be chosen. Although it is a question of rejecting 
the null hypothesis at a given risk in the former problem, it is a question 

273 


274 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


of balancing between wrong and correct decisions in the latter problem. 
Although the a priori probabilities have no place, even conceptually, in 
problems of testing a null hypothesis, they are essential for a satisfac- 
tory solution of the problems of discrimination. In all scientific investi- 
gations both problems are important. 


8a.2 Null Hypotheses 


Consider the following problem, which frequently crops up in biologi- 
cal research. 

A specimen is observed, and it is desired to know, on the basis of some 
morphological measurements, whether it belongs to a previously classi- 
fied group whose characteristics are either known or estimated from a 
sample of individuals from that group. In such a problem there are 
only two possibilities: the new find belongs either to a known group or 
to an unknown group. The alternatives to the specified one are obvi- 
ously undefined. The new group might be one whose existence has yet 
to be established. 

Thus, when a fossil is discovered the paleontologist inquires whether 
it is a specimen from a known collection. Such an inquiry is often made 


with the hope of obtaining a negative answer, in which case the fossil 
could be taken as a new specimen. 


The investigator may not be successf 
men from a previous collection 
dence supplied by the observe 
by a statistical test in such a ¢ 
rushing to a hasty conclusion 
specimens like the observed or 
served from the characteristic: 
proportion in the group, 
pothesis cannot be consid 


ul in distinguishing a new speci- 
because the answer depends on the evi- 
d specimen. The only safeguard offered 
ase is that it checks the investigator from 
unless the evidence is strong enough. If 
differing to a greater extent than the ob- 


Only when this proportion is 
at the observed specimen be- 


The choice of level of signifi- 
* but once it is fixed the rule of procedure 
t may be possible to refute any statement 


made about the observed specimen. Such an inference is possible only 


when some tisk is allowed. 


* It is not arbitrary in the sense that we are assuming one value when in fact it 
should be something else. It is one which is chosen by the investigator. Thus if 
the consequences of rejecting a true h 


pothesis involve a great loss it is reasonable 
to keep the level of significance as low as possible. 


NULL HYPOTHESES 275 


On the other hand, it is almost impossible to assert that the new find 
belongs to a specified group. To make such a statement we must ascer- 
tain whether the chance of the observed specimen’s arising from any 
other group is small. This is clearly not possible when the alternative 
groups are undefined ones. 

There is clearly no scope for the introduction of a priori probabilities 
in this case. However perfect our past knowledge may be about the 
species that have been already studied and their relative numbers, 
nothing can be said about the new species to be discovered. When 
these new species are considered as alternatives to a null hypothesis 
tested, there is no method of attaching a priori probabilities to the 
alternatives. 

Sometimes the a priori probabilities are introduced not as objective 
quantities measured by observed frequencies but as measuring merely 
psychological tendencies. If this is so we need further rules of procedure 
for choosing the a priori probabilities themselves. One can recall the 
efforts made by Jeffreys (1948) in this connection. To remove some 
apparent, contradictions in Bayes’s postulate of equal ignorance, Jef- 
reys advocates the use of certain invariant functions of the parameters 
Occurring in a probability distribution as a priori weights. Even here 
no argument is put forward for using particular invariant functions of 
the parameters. In fact, different choices lead to different results so 
that no objective theory could be built up on the lines of inverse prob- 
ability. 

To take another example, a geneticist inquires on the basis of observed 
data whether two factors are segregating independently. If he can dis- 
Prove this with some confidence, then he acquires some basis on which 
to plan future experiments, to estimate the intensity of their linkage, 
and to study the relationship of the two factors under consideration 
With others. If data are not sufficiently numerous, loose linkages go 
undetected and it is only by repeated experimentation and accumula- 
tion of evidence supplied by other factors linked with the former that 
some definite conclusion can be arrived at. 

The alternative to the hypothesis of independence in the above prob- 
lem is linkage with all possible values of the recombination fraction 
(lying between 0 and 1). To the experimenter it is definite knowledge 
if he can disprove the hypothesis of independence. Only then will he 
Proceed to inquire what the value of the recombination fraction is and 
try to obtain an estimate. To ask for a priori probabilities of the alter- 
hative recombination fractions before attempting to answer the prob- 
lem posed is to believe that from previous experience the frequencies 
With which various recombination fractions occur can be deduced. But 


276 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


there may not be sufficient reason to believe that the frequencies so 
derived correspond to the total frequencies obtainable from all possible 
factors known and unknown. 

In the problem of the paleontologist the alternatives are completely 
undefined whereas in the problem of the geneticist the alternatives are 
known, viz., that the recombination fraction lies between 0 and 1. But 
in both types of problems there is no scope for the introduction of a 
priori probabilities. 

The null hypothesis is one which is chosen by the experimenter appro- 
priate to his inquiry. When sufficient evidence gathers against this 
during the experimental work, he rejects it. He is not trying to balance 
between the evidences supplied by the data on the various alternatives. 

Whether a particular null hypothesis is rejected or not, there is a 
class of null hypotheses which are not contradicted by the data at a 
given level of significance. Any hypothesis outside this class is rejected. 
The class of null hypotheses acceptable to the data supplies us with 
what may be called a fiducial set. When the hypotheses refer to the 
values of a parameter, the fiducial set will be in the nature of an interval 
called the fiducial interval (Fisher, 1947). The fiducial set of hypotheses 
may be asserted to contain the true hypothesis because the chance of 
its being left. out is small (equal to the percentage level of significance 
chosen). Thus, although it is not possible to accept 
esis, it is possible to restrict the scope of inquiry to only a subset of all 

possible alternatives. Any further discrimination among the alterna- 
tives in the fiducial set has necessarily to be based on insufficient evi- 
dence. No statement of confidence can be made about a single hy- 
pothesis chosen by any rule of procedure as the most appropriate for 


the data, and consequently such a procedure does not possess a scientific 
basis of inference. 


any single hypoth- 


If the problem needs the choice of a single hypothesis, then what 
should be the nature of the answer? We might try to form 


of procedure which selects a hypothesis which is as near as 
the true hypothesis and which in large samples differs very 
the true one with probability approaching certainty. The p 
choosing that hypothesis which maximizes the likelihood 
Fisher, conforms to the above requirement to a large ex 
two methodological problems, testing of hypothesis 
admit neat solutions independent of the probabilities a 

As an example for the determination of the fiducial i 


a sample t1, `++, £n from a normal population. If p 
value, then 


ulate a rule 
possible to 
little from 
rocedure of 
, advocated by 
tent. Thus the 
and estimation, 
priori. 

nterval, consider 
is the true mean 


t— yp 


7a" 


t 


POWER FUNCTION OF NEYMAN AND PEARSON 277 


is distributed as ¢ with (n — 1) degrees of freedom. All values of » not 
acceptable to the data at the 5% level of significance satisfy the in- 
equality 


T— H 
s/ Vn 


where ts, is the 5% significant value of t. This gives two values 


= tse 


i-—— 3 and © + 
Vn Vn 


beyond which all values of » are incompatible with the observed data. 
In such a case we could assert subject to a small risk that the true value 


of y lies in the above interval. 
Similarly, if the fiducial interval for o” is needed, then two equations 


are considered: 


(n — 1)? P (n — 1)s? s 
—_—$ =x m [n 
A o? G 

giving A ss 
z 1)s° y _ =i 
o? = and o = — 

2 2 

x x2 


where (x12, x22) is the critical interval given in Table 6a.læ for x? with 
(n — 1) degrees of freedom. 


8a.3 Power Function of Neyman and Pearson 

Various attempts have been made to build up a consistent theory 
from which all tests of significance can be deduced as solutions to pre- 
cisely stated mathematical problems. It is difficult to argue whether 
such a theory exists or not, but formal theories leading to a clear under- 
Standing of the problems are nonetheless important. One such theory, 
contributed by Neyman and Pearson (1933), is an important develop- 
ment because it unfolded the various complex problems in testing of 
hypotheses and led to the construction of general theories in problems 
of discrimination, sequential tests, etc. 

Any rule of procedure by which we can reject or accept a given 
hypothesis Ho consists in a division of all possible samples into two 
groups, one opposed to Ho and the other not unfavorable to it. 
Whenever a sample of the first category occurs, we reject the hypothesis 
Hy. As observed earlier, the frequency of the samples in this category, 
When Ho is true, ought to be small so that the chance of rejecting the 
hypothesis when it is true is small. Let this chance be fixed as @ (a 
Small assigned quantity). Corresponding to any procedure such as the 


278 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


above, there is a frequency with which the samples of the first category 
appear under a different hypothesis H. This frequency, denoted by 
B(A), is called the power of the test procedure associated with an alterna- 
tive hypothesis H. 

This function 6(H) is fundamental in the theory of Neyman and 
Pearson. It gives us the frequency with which various alternative 
hypotheses could be detected, or, to be more exact, the frequency with 
which the hypothesis Ho is rejected when a different hypothesis H is 
true. If the sample is to be given a fair chance of rejecting Ho, when it 
is not true, the division of the samples must be such that the frequency 
of those in the first category is as high as possible under any different 
hypothesis. To start with, let us determine the maximum possible 
frequency of detection associated with a given alternative hypothesis. 

Let f(x) denote the probability density of the observations x under 
any hypothesis H. The sample observations ti, +++, Zn may be repre- 
sented by a point in a space of n dimensions, in which case the rule of 
procedure suggested above results in a division of the space into two 
regions, w for rejecting the hypothesis and the rest for not favoring any 
alternative. Then, for a given H , What is w such that 


f Sue) dv =a (an assigned quantity) (8a.3.1) 
and à 


Í: fn(x)dv is a maximum? 
w 


Applying the result of lemma A1 


in Appendix A, we find the best region 
to is defined a pp x A, we find the best reg 


Ju > Mu, inside wo 
and 


Ju <n, outside wo 
in which case the maximum B(H) is 


BUH) = f AOE 


where ) is determined to satisfy 82.3.1. 


We are in a happy situation if the same region wo makes (H) a 
maximum for all H. In this situation wọ is independent of H, and the 
knowledge of any particular alternative H which may be true does not 
help us in improving the test, The wo satisfying this property is said 
to be a uniformly most powerful critical region, and when this exists the 
test procedure is above criticism since nothing has been assumed about 


POWER FUNCTION OF NEYMAN AND PEARSON 279 


the alternatives. The existence of such tests can be easily verified be- 
cause in this case the boundary of the critical region, fa/fa, = constant, 
can be expressed without the use of any unknown quantities entering 
in fy. 

Example 1. Consider n independent observations from a normal 
distribution N (u, o°). Let the null hypothesis be u = po. 


j Ma a 
= cexp a 
H P 2 
-2(z; — no)? 
Juo = ¢ exp — A 
pen m? — p? 
MEA Hetet Bq) +7 
Sito o 20° 


The relationship 


ink > logà 
Ho 


E(u — wo) > k 


reduces to 


or 
Z>h if u > uo 
Z< ka if u < uo 


A uniformly most powerful test exists only when it is known that the 
alternative value is greater or smaller than the assigned value yo. The 


test simply depends on the distribution of the mean, #, on the null 
hypothesis. The distribution of @ is 


E = iad ga 
xp | —————_ | di 
cexp a 
Which involves another parameter g? so that the test can be carried 
Cut only when the hypothetical value of the standard deviation is 
known. When it is not known, a suitable device is necessary to make 
the test independent of o. j 

Example 2. The best region for testing the hypothesis Ho against a 
Single alternative H is bounded by the surface of a constant value of a 
function of the minimal set of sufficient statistics. 


When a set of sufficient statistics T1, T2, --+, Tp exists, 
fn = P(T | H)P | T) 

and 
Ím = P(T | Ho)P(e | T) 


280 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


and hence the ratio fzr/fi, is equivalent to P(T | H)/P(T | Ho) which is 
a function of Tj, ---, Ty only. 

Example 3. The probability of r successes in n trials of a binomial 
population with proportion p is 


a pg -p 
# 


If the null hypothesis is p = po, then log (fir/fy,) > log \ reduces to 


r {log 2 — tog 2] 2k 
Po qo 

or 

rək if p2p 

and 


TSho if p<p 


so that the uniformly most powerful test exists only when it is known 
that the alternative is greater or smaller than the assigned value. 

In this example it is presumed that the best test is offered by the 
ratio fi H/Í to even when the stochastic variable is discontinuous. The 
difficulty arises owing to the fact that there may not exist a d such that 
the probability that Su = Afm is exactly equal to the assigned value, 
the percentage level of significance. The setup appropriate for discrete 
probability densities is to determine a class of events E such that 


Pn (E) < a 


me an assigned value (8a.3.2) 


Pa(E) isa maximum 

m class r so determined constitutes the critical set, and the 
A A is 0 hat event in this set disproves the null hypothesis. It is 
pres y = in the case of continuous distributions the equality 
-a $ - .2) is attained. Under the new setup the events in the 
critical se = those for which the ratio (f n/f1) = A, where A is the 
ae 3 aa that e total probability of the events on the null 

not exceed a. The proof is similar i 
of continuous distributions. proof is similar to that in the case 


8a.4 Locally Most Powerful Unbiased Tests 
Taea most powerful tests exist very rarely so that in most cases 
> ere be not be a single region which is the best for all alternative 
on i a a first step in making the test independent of the alter- 
native hypotheses, Neyman and Pearson introduced the concept of the 


LOCALLY MOST POWERFUL UNBIASED TESTS 281 


locally most powerful unbiased test, applicable to cases where the hy- 
pothesis is specified by the value of a parameter occurring in the prob- 
ability distribution. Assuming differentiation under the integral sign, 
the solution depends on the existence of a region w such that 


[162 00) dv = a (8a.4.1) 
[10] 0) @v = 0 (80.4.2) 

and = 
f J(e | 0o) dv is a maximum (8a.4.3) 


9 is the value of the parameter under the null hypothesis. 
It follows from the lemma in Al (Appendix A) that a region wo 


inside which 
J” (00) = kaf(o) + kof’ (60) 
outside which 


J” (00) < Fa f@o) + kof’ (60) 


where kı and ko are determined to satisfy the conditions (8a.4.1) and 
(82.4.2), maximizes the integral in (82.4.3). 

This ensures maximum power only for alternatives in the immediate 
Neighborhood of the null hypothesis. This is not a good solution unless 
the power is quite high for alternatives more distant from the null 
hypothesis also. In fact, if a locally most powerful test has a very low 
Power beyond a certain range near the null hypothesis, no investigator 
would be tempted to use it. There is no provision in the method of 
derivation of a locally powerful test to safeguard against this. This 
method therefore cannot be considered as general but can be regarded 
only as a means by which test criteria can be derived for possible com- 
Parison with any other offered test procedure. , 

Example 1. The locally most powerful test for the null hypothesis 
© = go is defined by 


s>s? and s’<s," 


2\ n/2 2\ n/2 
ns N” 7ra (= ) 0773/200? 
2 To 


2 


f FOLE: f P(s?) ds? = a 
a1” 


0 


such that 


and 


282 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


where Cr x 
P(s2) = const. e7” (82) 2)? qs? 


and s? is the estimate of o? based on n degrees of freedom. 

Example 2. Among locally unbiased tests the power for any alter- 
native value ø is greatest in the above case. 

Example 3. What is the locally unbiased most powerful test for a 
given ratio of two hypothetical variances estimated from two inde- 
pendent samples from normal populations. 

(Hint: Start with the variance ratio F distribution on 71, na degrees 
of freedom, assuming a hypothetical ratio p. The best test leads to the 
condition that a quadratic expression in F is greater than zero, so that 
the test is F > Fy and F < Fo. To determine the relation between Fi 
and F's, express the condition of unbiasedness. This leads to a condition 
of the form F;/(1 + cl)? = Fs/(1 + cFs)4] 

Example 4. Prove that among locally unbiased tésts the test derived 
in example 3 is uniformly most powerful. 

The test derived in examples 1 and 2 is used in Chapter 6 in testing 
whether a calculated variance is in agreement with an assigned value. 
The test derived in examples 3 and 4 is useful in testing whether two 
estimated variances are equal in their expectation. In Chapter 6, a 


different test based on the L statistic was used. But these two are 
equivalent. 


8a.5 Test for a Finite Number of Alternatives 


Consider a null hypothesis Ho and a set of alternatives H,, Ho, 
Let the power of the best possible test for H o When H; is the only alter- 
native be denoted by y;(a), where a denotes the level of significance. 
Any region w suggested as the critical region for testing Ho will have 


fi idv = B;(a) 


as the power for the alternative H;. In no case can B exceed y, but 
there may exist a single region such that 8; = Yi for all 7, in which case 
a uniformly most powerful test exists, f 


If this is not so, various alternatives have been suggested. One is to 
choose a region which maximizes the minimum B (Neyman and Pear- 
son, 1933). Sugh a procedure may give undue preference to the hy- 
potheses nearer * to the null hypothesis. It may be felt that a method 

*A hypothesis H; can be said to be nearer than H; to Ho if y; < yz With this 
concept a suitable distance function between two hypotheses can be defined as 
shown in Chapter 9. 


TEST FOR A FINITE NUMBER OF ALTERNATIVES 283 


which effectively controls the errors of not accepting a nearer hypoth- 
esis when it is true will be good enough for distant hypotheses. 

On the other hand, we may take the view that in the course of ex- 
perimentation it is necessary to detect a distant hypothesis as early as 
we can. If, in fact, a distant hypothesis were true and the critical re- 
-gion had been so chosen as to give this hypothesis the maximum possi- 
ble power, then it could be discovered with the minimum possible 
number of observations. If a nearer hypothesis were true, a larger ex- 
periment would be necessary to detect it. In such a case the experi- 
menter might consider himself unlucky on the choice of his subject or 
might regard the consequences of accepting Ho when, in fact, an alter- 
native close to it is true as less serious than when the alternative is 
distant. 

A compromise solution may be suggested if the experimenter can 
assign a priori probabilities for the various alternatives. This means 
that he has a knowledge of a series of similar experiments and the fre- 
quencies of various types of alternatives. When such a knowledge is 
imperfect or if the experimenter is not sure that the particular experi- 
ment he is conducting belongs to the same group of experiments that 
have been conducted before, no unique solution is possible. In the ab- 
sence of any information about the a priori probabilities, as a compro- 
mise between the two views of maximizing the minimum power or 
giving more weight to distant hypotheses, the following solution is 
Suggested. 

The critical region w is chosen such that the common ratio 


bila) = Bo(a) _ 
yila) Yla) 


is a maximum where £ and y are as defined above. This method sup- 
plies a system of weights to be attached to the powers due to various 
alternatives, the weights being the individual maximum powers. This 
region has the following two properties. 

(i) The distant hypotheses have necessarily more power than the 
nearer hypotheses. 

(i) The individual maximum powers are now reduced in the same 
Proportion with the provision that this proportion is as small as possible. . 

If fo, fi, fo, +++ denote the probability densities for the hypotheses 
Ho, Hy, Hs, «++, then the region satisfying the above requirements is 
deducible from the lemma proved in Appendix A4. The inside of this 


region w is defined by 
fo < Mahi + Nfe + 


284 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


where A, àz, +: are determined from the relations 


fi dv=a (8a.5.1) 


and 


1 
4 fid=— fras (8a.5.2) 
Yı Yw Y2 Yw 


The solution deduced above is not useful in practice because of the 
difficulty in evaluating the constants. It may be convenient to con- 
sider the region complementary to 


fo = mifi (8a.5.3) 


as the critical region, the quantities 41, H2, +++ being determined to 
satisfy the relations (8a.5.1) and (8.5.2). 


8a.6 Tests When the Alternatives Are Continuous 


The foregoing theory could be extended to the case where the alter- 
e 


natives can be specified by parameters with continuous variation. The 
following definitions will be useful, 


A region w which gives equal power to all hypotheses equidistant, 
i.e., having the same power of detection from th 
called the dist 


(an assigned value), 
(ii) wo is a distance power region, 


(iii) For any specified alternative hypothesis the power associated 
with the region wy is not less than the power for any other region satisfy- 
ing requirements (i) and (ii). 

Let A denote the distance of ah 


ypothesis H from Hp, the null hypoth- 
esis. Then a distance power regi 


on satisfies the condition 


fia du=¢(A) a function of A only 
w 


If the parameters entering in the alternative h 


: Ypothesis are denoted 
symbolically by 8 and in the null hypothesis by bo, then 


At J@) dv = gA) and Í. Jo =a (8a.6.1) 


TESTS WHEN THE ALTERNATIVES ARE CONTINUOUS 285 


Let us define the inside of the region by 


f(0) < A(A)FO) ds +++ (8a.6.2) 
A=const. 

where the integral is taken over the surface A = constant. Let there 
exist a positive function (@) such that the conditions (8a.6.1) are 
satisfied. The region wo, if it exists, is the best distance power region 
for alternatives on the surface A = constant. This follows from the 
lemma in Appendix A4 extended to an infinite set of alternatives. If 
the relationship (82.6.2) is independent of the alternative used, then we 
obtain a uniformly best distance power test. It is seen that the region 
(8a.6.2) is the same as the region which has the best average power for 
alternatives on the surface A = constant and for an assigned a priori 
probability density \(@) of the parameters. Although in the theory of 
average power tests there is no justification for choosing a particular 
type of the density function \(@) on which the test generally depends, 
the function (6) is suitably determined in constructing distance power 
tests. The determination of such a function, even if its existence is 
known, may be a difficult task. Once it is determined by trial or other- 
Wise, the optimum property of the test is immediately established. 

It is of interest to examine the critical region obtained by extending 
the results in (8a.5.3) to the case of alternative hypotheses specified by 
Parameters with continuous variation. The outside of such a critical 
region is defined by 

f(0) = MO, ASCO) 
for all @ on the surface A(@) = A, where A is the specified distance of the 
alternative from the null hypothesis. If, owing to considerations of 
symmetry, the function \(@, A) could be replaced by a function of A 
only, then the critical region is the outside of the envelope of the surfaces 


BOE ie t 


for variations in 0 on the surface A(0) = A. This is the likelihood ratio 


test developed by Neyman and Pearson (1928). 
Example 1. Let 21, ***, tn be n independent normal variates having 


zero mean on the null hypothesis. For any alternative hypothesis 
Specifying the mean of x; as p; the best test is 
pti +` -e+ Hntn 2 k 


and the associated power is a function of u? = p1? ++ ++- mn”. So ia 
can define the distance of the alternative hypothesis from the null by x?. 


286 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


The best distance power test, if it exists, is given by 


279,2 zzp o 
g7 27/20 <f: E(zi— pi)?/20 flus x <, En) dum yst dun 


BE 


f(#1, +++, #n) can be chosen to be a constant, in which case because of 
symmetry between x and u the test reduces to 
Dx? > k 


which is a distance power test, the distribution of Ex? being that of a 
noncentral x” involving only the parameter x2. 

Example 2. Following the method of example 1, construct a distance 
power test to examine whether p correlated variables have assigned 
mean values. This leads to Hotelling’s T with a known dispersion 
matrix, in which case T has a x? distribution, 


When the variances and covariances are not known in examples 1 
and 2 above, a slightly different technique has to be followed. The 
best region, besides satisfying the above property, 
the sample space with respect to the unknown v 
iances. That is, the integral of the 
region is equal to a constant a, wh 
iances and covariances. W 
construction of similar regi 
and multivariate tests co 
distance power tests. 


must be similar to 
ariances and covar- 
probability density over such a 
atever may be the value of the var- 
e shall not enter into the mathematics of the 
ions but simply note that all the univariate 
nsidered in the earlier chapters are all best 


8b Problems of Discrimination 
8b.1 The General Problem 


We now come to a group of problems where 
needed for a satisfactory solution and the null 
a prominent part but is sometimes 
to a small risk. 

Thus when a question is asked whether a skull or a jawbone belonged 
to a male or a female, there are evidently two alternative hypotheses 
and one has to be chosen. Here a procedure is needed by which the 
individual specimen can be assigned to one or the other of the groups. 
ure, errors are inevitable unless the ranges 
° groups are completely different, 
lowing problem. Suppose an individual is 
ulation consisting of two distinct groups of 
individuals in the ratio Tim, (71 + Tg = 1). If œ is the chance of 


a priori probabilities are 
hypothesis does not play 
posed to arrive at a decision subject 


of measurements for the tw 
We first answer the fol 


THE DISCRIMINANT FUNCTION OF R. A. FISHER 287 


wrongly classifying an individual of the first group by following any rule 
of procedure, and ag the corresponding chance for the second group, 
then the probability of wrongly classifying an individual chosen at ran- 
dom is (mæ + 722). Evidently that procedure is the best which 
minimizes this error. 

If the individual admits p measurements, then we need a division of 
the p dimensional space into two regions, Rı and Rə, such that when the 
point representing the p measurements falls in R, the individual is as- 
signed to the first group, and otherwise to the second. If fi(x | 01) and 
fe(x | 02) represent the probability densities, then the chance of com- 
mitting an error is 


mı | fdv + rf fodv 
R2 Ry 


We need such a division for which the above value is a minimum. Fol- 
lowing a lemma given in Appendix A2, we find that the best regions 
are 

R Nn mf = mofo 


R N tof. = mifi 


? 


where the symbol N stands for “defined by.” This supplies a mutually 
exclusive division of the space into two regions, Rı and Rə. The case 
where the equality occurs can be decided by considering the correspond- 
ing relationship when one measurement (chosen at random from the 
available p) is omitted. 

If, in any problem, there is scope for the introduction of a risk func- 
tion specifying the loss incurred in a wrong classification, then the best 
solution which minimizes the expected risk can be determined as fol- 
lows. Let rı be the loss resulting in assigning an individual of the first 
group to the second, and rə of the second to the first. Then the expected 
loss is 

way" F Tee2"2 
The best solution is i 


R N mirifi = Terefe 
Ry N Tofof2 = mnf 


8b.2 The Discriminant Function of R. A. Fisher 

In the cases considered in the above section it is seen that the bound- 
ary separating the two regions in the space is defined by a constant value 
of the likelihood ratio. If the probability densities are multivariate nor- 
mal with the same dispersion matrix (aij) and mean values, m, °°", 


288 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


Mp1 and p19, ``, pe, for the first and second groups, then the likelihood 
P . . 
ratio or its logarithm is 


BDo[(a; — wn) (tj — war) — (te — wie) (3 — nja)] 


where the matrix («’) is reciprocal to (aj). Simplifying the above 
expression, the surface of a constant likelihood ratio can be expressed 


P 
È (id, +--+ aPidp)zi = const. (8b.2.1) 
i=1 


where dj = uji — uja, (J = 1,2, +++, p). The regions in the p dimen- 
sional space are thus separated by a hyperplane whose equation is 
(8b.2.1) for a suitably determined constant. An individual for whom the 
value of the left-hand function exceeds the constant value chosen is as- 
signed to the first group and when it is smaller he is assigned to the sec- 
ond group. 

The linear function of the measurements deduced above is called the 
discriminant function, first introduced by R. A. Fisher, who suggested 
the following computational procedure. 

If there is only one character, then the problem of classification is very 
simple; all individuals with the value of that character exceeding a suit- 
ably determined value could be assigned to one group, and the rest to the 
other. The multiple character case is reduced to that of a single variate 
by replacing the several measurements by a suitably chosen linear com- 
pound. If 21, 2, +++, Tp are the measurements, then an arbitrary linear 
compound is hz; +--+ lpr». The coefficients l, «++, lp may be chosen 


such that the linear compound affords the maximum discrimination be- 
tween the two groups. 


The function yay +- - -+ [p%p has the variance 


ZZa;;jlil (8b.2.2) 
and the square of the difference in mean v 


alues of this compound for the 
two groups is 


(Ldi +--+ lpdp)? (8b.2.3) 
The coefficients may be chosen such that 
is a maximum, subject to the condition 
constant (say, unity). This is also equi 
of (8b.2.3) to (8b.2.2) without any con 
efficients. Introducing a Lagrangian mi 
expression 


the difference in mean values 
that the variance (8b.2.2) is a 
valent to maximizing the ratio 
dition on the compounding co- 
ultiplier A and differentiating the 


ZI jdid; = AZZayl il; 


DIFFICULTIES OF THE BEST DISCRIMINATING SOLUTION 289 


we obtain the equations 
lia + l2&12 ++ lp&ip = kd, 
lazı + lza22 ++ +++ lpazp = kde 


lapi + loop +*+ lpapp = kdp 


where k = (ld; +--+ lpdp)/à. Observing that the above equations 
can supply only ratios of lı, +*+, lp, we may substitute k = 1 and solve 
the above equations. The final values of l, +*+, lp may be adjusted by 
multiplying each of them by a constant @ where 


PDI loi; = 1 


This is unnecessary because the constant separating the values of the 
discriminant function for classification into the two groups can be ad- 
justed suitably. The linear equations obtained above have the solutions 


l; = adi +--+ oP dp i=1,2, p 


thus giving the same linear function derived as the ratio of the two like- 
lihoods. Thus Fisher’s linear discriminant function is the best for 
classification when the distributions are multivariate normal and the 
dispersion matrices are the same. If the dispersion matrices are dif- 
ferent, then the likelihood ratio surface is defined by the quadratic 


expression 
TD aM(e; — wir) (ej — Bat) — Bii(a; — wiz) (a; — uj2)} = const. 


where (a) and (8) are the inverses of the dispersion matrices corre- 
sponding to the two populations. 


8b.3 Some Difficulties in the Use of the Best Discriminating Solution 

The elegant solution obtained in or has many limitations so far as 
the i ications are concerned. 

pioa amia esaa in the probability distributions are not 
usually known. The only solution is to obtain their best possible esti- 
mates and substitute them for the unknown values in setting up the 
discriminant function. This introduces some additional errors in 
classification, depending on the paucity of the available material for 


the estimation of the parameters. 


290 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


(ii) The a priori probabilities explicitly occurring in the best solution 

may not be known, and in some cases they may not be estimable from 

i ata. 

i se the finite number of alternatives into which an 
observed individual has to be classified is assigned by the investigator. 
In some cases, such as sexing, there are only two alternatives possible. 
But in general it may be necessary to test whether the a priori informa- 
tion that an individual belongs to one of the given groups is correct or 
not. Only when the a priori information is ascertained to be correct 
can we proceed to decide to which of the given groups he is likely to 
belong. 

(iv) Even by following the best procedure of classification it may not 
be possible to assert with confidence that any individual has been cor- 
rectly classified. Can any provision be made to identify at least those 
cases which are less likely to be misclassified? 

(v) Suppose that it is known that an individual has been taken at 
random from only one group and it is not known whether it is the first 
or the second group. Should he be treated in the same way as an indi- 
vidual drawn from a mixed population? 

(vi) What is the nature of a risk function in biometric investigations? 

(vii) Suppose that some quick decisions are needed. Is there any 


simple method of arriving at a discriminating function which is a good 
approximation to the ideal one? 


These problems are discussed 


in the following sections with suitable 
illustrations, 


8b.4 Uncertainty of the A Priori 
Alternatives Is Correct 

For discriminatory analysis it 

longs to one or the other of tw 


Information That One of the 


must be known that an individual be- 
© groups. Such knowledge has to be in- 
ferred from external evidence; the association of artcrafts with a burial 
sometimes provides it for skeletal remains. In questions of sexing 
bones, the chances of identification are limited to two possibilities, male 
and female. However, where the external evidence is slight or equivo- 
cal, the assignment of an individual to one of two groups may be sub- 
ject to another kind of error, viz., the wrong assumption that he belongs 
to one of the two groups when, in fact, he comes from a third unknown 
group. In the absence of any definite knowledge about the nature of 
the third group, we may have to examine by means of the internal evi- 
dence supplied by the measurements on the individual whether he could 
be considered as belonging to either of the two 


á groups; that is, we ex- 
amine whether there is any evidence to suggest that the individual could 


UNCERTAINTY OF THE A*PRIORI INFORMATION 291 


not have come from one or other of the two groups. Consider the fol- 
lowing problem. 

In August, 1939, a relatively complete male human skeleton was 
recovered from the ditch of an Iron Age camp on Highdown Hill, Gor- 
ing by Sea, in the course of excavations conducted under the auspices 
of the Worthing Archaeological Society. Fragments associated with the 
bones suggest that the burial could not have taken place later than the 
very beginning of the Iron Age in Sussex, about 500 B.c. The camp 
went out of use not later than 250 B.c., and the remains themselves can 
be assigned to a 500 s.c. “invasion” horizon. It is doubtful, however, 
whether their owner was a Bronze Age “defender” or an Iron Age “in- 
vader.” The principal question to be considered in the present context 
is whether the Highdown skull is more likely to have belonged to a 
Bronze Age or to an Iron Age population. 

An attempt is made here to answer this problem by utilizing the pub- 
lished data concerning the Bronze Age and the Iron Age represented by 
Romano-British crania from Maiden Castle. The characteristics of 
these groups have been computed from scanty material, so that the 
conclusion regarding the Highdown skull cannot be treated as final. 
This example has been chosen merely to illustrate the method. 

In solving the problem whether the Highdown skull belongs to the 
Bronze Age or to the Iron Age, we can test separately the two null hy- 
potheses: (1) it belongs to the Bronze Age, and (2) it belongs to the 
other group. If neither of the two hypotheses can be rejected on the 
5% level, there is sufficient justification to proceed with the problem of 
assigning the skull to one of the two groups. 

It must be noted that in such a procedure we are not testing the 
combined null hypothesis that the specimen belongs to one or the other 
of the groups at the 5% level. Of the 5% of the rejected cases under 
One hypothesis, some are accepted under the second hypothesis so that 
when the 5% level is used for the two hypotheses separately we will be 
judging the combined null hypothesis at a lower level. . 

An adjustment could be made in the test procedure to correct this by 
defining the critical region (of rejection) 


EEaÄ (x; — ma) (tj — wn) 2 € 
DZat (x; — mia) (ej — mje) Z € 
where (ač) is the reciprocal of the dispersion matrix (aj); mit, Hiz, are 


mean values for the two groups; tı, %2, *** are the measurements on 
the individual; and c is chosen such that the total density of the region 


292 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


is a (the assigned value) for each of the probability distributions. eia 
existence of c satisfying the above condition follows from jr T 
It is difficult to find its actual value. In practice the procedure outline 
earlier of testing the two hypotheses separately may be followed. 

Table 8b.4a gives the mean values of male English Bronze Age 
(Morant, 1926) and Maiden Castle (Goodwin and Morant, 1940) 
cranial measurements. 


TABLE 8b.4a. Measurements on the Highdown Skull: Mean Values of Bronze Age 
and Maiden Castle Series 


Highdown Bronze Maiden 

Character Skull Age Castle 
L 198.2 184.5 (45) * 188.6 (23) 
B 148.1 149.9 (89) 140.8 (24) 
H’ 142.0 134.9 (25) 137.1 (22) 
G’H 72.4 69.1 (30) 72.4 (14) 
GB- 95.8 98.0 (11) 95.2 (18) 
NH,L 48.2 49.1 (13) 51.9 (16) 


* The figures in parentheses indicate numbers on which the avera 


L = head length from the glabella in the medi 
B = head breadth on the parietal bones perpe 
H’ = head height from the basion to the bregma. 
G'H = upper facial height from nasion to the alveolare. 
GB = bimaxillary breadth between the zy 
NH,L = nasal height from the nasion to the 


ges are based. 


an sagittal plane. 
ndicular to L. 


gomaxillaria. 
left nariale. 


In the present problem, since only a few of the Bronze Age standard 


deviations have been published and none are available for the Maiden 
Castle, the variance-covariance 


Farringdon Street (E 
used in the analysis. 
observed elsewhere the 


ed functions of the 


original variables. This 
ing the dispersion 


can be easily obtained by us matrix with a unit matrix 
appended to it and reducing the dispersion matrix by the method of 


sweep out as shown in Table 8b.48. The theoretical discussion asso- 
ciated with this procedure is treated in Appendix B. 


293 


‘dO'FL AQUIL UI UAOYS SV XIZeUT puvq-4zə; əy} ur 
ppuoJvip 04} moq soInDY oy} Durywpouru1o0o9o9y PUB SAQUBNVA [BuIdUO Jo suorgouny 10} xugeur Oy} Zu1ggruro Aq poyussoido1 A}ZOVduUIOD oq ugə suotegnduoo L f PA S 
; ‘SIDUVIVA IY SB OLPE'F PUB ‘CICG'PE 'ZCHO'CT ‘OFFI'ES ‘SSZ0'SE ELITY Aq 
9x pus ‘97 ‘hy SA ZA ‘Ty Syr ‘MOI 4vq} ur (pourgapun) uau [BOAId OY} EI DOUTUVA ƏSOyM SO[QUUvA OY} JO UOIZOUNJ Ivu g səyddns xugBu poonpel əy} 
UI MOI JSIY OYL ‘XHB pIMPpAI OY} UDALI BI MOJ [ejoaid yous Auimojjoy 'UO01gənpəI jo ƏZVIS YVI JB SAMO [VIJOAId Əy} QUOsOIdeI OG ‘OF ‘OS ‘OZ ‘OT SAMOI OUT, `Z 


z *]BOLIJOUIWAS 918 £95848 JV 48 SVd1IZVUL OY} OSNBIEG pIo OIE [VUOSIP 04} AOTEG s7uəu19]ə OYT, 'T 
“4 
a 
E ogo %A= © 28F0°0— Z6g'0— €980'0— 98700  6FOT'0- OLFE F Ie 
S emo 198Z0°0  +6z00'0— 8ST00'O  26200°0— $8200'0— | Z29700 I og 
= pe at a ern ee 
i  896g°9 I 6£9p'0— $z80°0— 68£0'0 ~—_ EFI T“0— OE z 
O  ggerze 9A = I 6Z0r'0— FSS0'0  8EOL'O— 9070- ES6G'I COLO FE oF 
re ooo €£+90'0  £gg00'0— 99I1Z0'0— ¢z000°0 16£97'0 F6ZOT'O I OF 
— e 
e. = 8186°Z1 I 8901'0—  SLIT‘O— SBT 0— Z992" L eg 
H  9692'6E I 10c0'0 S8El'"0— = F0Z0— LOVE'S giero =§ ___ Ze 
PA eget A = I  81c0'0— 29¢6°0—  6£00°0 OTIZ’  €009°T ZoPS ST Ig 
O  ooser't 1zeh0'0  89200°0— + s0900°0— | 9290T°0 6ooso'o — #8TS0'0 I og 
| a a a A 
X eoo I E9e1'0—  S4Z10— 00£0°8 +z 
FOS SE I 96Z1°0— Z261°0— OLIZ'S L6LZT"Se ez 
<  6LEL'OZ I _ 6cre'0— —- ¥E00°0— L6Ee°L ZOFS"T POT ____ Zz 
De I 82L1°0— —-O0FT0— GOLF'S ZOSI'l — 66I'l OPi ee 1Z 
rH L618" zzīgo'o  g0800°0— | eE9et'O 996ZI°0  66SrgL'O #82210 T oz 
H$ 1602'8z I 2791 '0— 2929'S l cI 
f  PSL6'Cb I 108% °0— TESL'Z LAT Le l +I 
O Z9 OF I 1260°0— g0c8'8  6926°2 LIF" 6I eT 
££99°9E I 6S8T“O— FLE OZE'O — OLLIE  BOST*FZ ZI 
b+  $090'89 ŽA = I £8670- 099g}  EZST'P SOSO'TI €969°S  ZezO ZE Il 
> LVVS6'I 96€20°0 122910 22082°0 F2Z60°0  96G8T°0 EE8SZ'O E i 
> 86° 1h I €2°6 90 
É 089 I cep © FO" LE cO 
2 669 I 8F°6 148g 08°61 70 
an eg'Ig I IS} ET 68'£ 09'Sg £0 
l ST 62 I Z1'9 $99 80°ZI 022 18's a ZO 
O eo'ra tt = I 6259 —-£9°6 18°¢ QL°L S01 ELIF 10 
Z 
A 
5 
‘ me) 
ger THN 9 HD iH a T T'HN a HƏ iH a TO a 


Jag poyV[at1oouys) UB 0} UOIYVULIOJSUBIT, JO UOIJONAJSUOT) Ə} 10} POY] uorgesuapuo) [VIOATT 'JF' qg CHERA N 


294 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


Obtaining the expressions for Y1, Yo, ---, Ye from Table 8b.48 and 
dividing them by the corresponding standard deviations, we obtain the 
following uncorrelated transformed variables with unit variances. 


y= 0.1548 

y2 = —0.0456L + 0.1767B 

ya = —0.0291L — 0.0369B + 0.20791" 

ys = 0.0010L — 0.0854B — 0.0131H" + 0.2536G’H 

Ys = —0.0346L — 0.0176B + 0.00941’ — 0.0174G'H + 0.1691GB 

ys = —0.0503L + 0.02098 — 0.040917’ — 0.2202G'H — 0.0219GB + 0.4796NH,L 


TABLE 8b.4y. Values of Transformed Characters 


Highdown | ip | Maiden | mitna | pë | p | Probe 
Skull Age Castle 2) - (3) | sae 3 bility 

Mean Mean ii > of Error 

(60) (2) (3) (4) (5) (6) (7) 

yı 30.681 28.561 29.195 —0.634 | 0.402 | 0.317 0.375 
yo 17.131 18.074 16.279 1.795 | 3.624 | 0.952 0.171 
Ys 18.289 17.145 17.819 —0.674 | 4.078 | 1.009 0.156 
Ya 4.051 3.140 4.729 —1.589 | 6.603 | 1.285 0.099 
Ys 6.810 7.615 7.124 0.491 6.844 | 1.307 | 0.095 
Ye — 7.606 | — 5.479 | — 5.287 —0.192 | 6.881 | 1.312 0.095 


To test whether the Highdown sk 


ull belongs to the Bronze Age we 
calculate the sum of squares of diffe 


rences; 
x? = (30.681 — 28.561)? + (17.131 — 18.074)? +... 
+ (—7.606 + 5.479)? 
= 12.694 


which can be used as x? with 6 degrees of freedom if the Bronze Age 


means and dispersion matrix have been obtained from a large sample. 
On the other hand, if the Bronze A 


UNCERTAINTY OF THE A PRIORI INFORMATION 295 


can be used as a variance ratio with p and (f + 1 — p) degrees of free- 
dom, p being the number of characters. For the purposes of the above 
example we shall treat the mean values and dispersion matrix as known 
so that the x? test can be used. The value 12.694 is just significant on 


the 5% level. 
Similarly, the x? for the Highdown skull and the Maiden Castle 


series is 
(1.486)? + (0.852)? +- -+ (—2.319) = 9.091 


which has a probability of more than 15%. By this criterion the High- 
down skull could be assigned to the Maiden Castle series. Since x? 
is only just significant in the other case, we might construct the dis- 
criminant function and decide the issue. 

Since all y are uncorrelated, we can very easily construct the dis- 
criminant function. For instance, the one based on the first three 
values of y is 

—0.634y, + 1.795y2 — 0.674y3 
where the coefficients are the differences in mean values of y for the 
Bronze Age and the Maiden Castle series. The discriminant function 
based on y1, yo, Y3, ys is obtained by adding —1.589y4 to the above 
expression, and so on. The discriminant function with all the charac- 
ters is 


—0.634y, + 1.795y2 — 0.6743 — 1.589y4 + 0.49145 — 0.192y6 


which has the mean values 


—4.300 and 2.581 


for the Maiden Castle and the Bronze Age series, with the middle value 
—0.859. Suppose we follow the procedure of assigning all individuals 
with values of the discriminant function above —0.859 to the Bronze 
Age, and all others to the Maiden Castle series. Then the error in 
classification corresponds to the area above — 0.859 for a normal distri- 


bution with mean —4.300 and variance 
D? = (—0.634)? + (1.795)? +-+ -+ (—0.192)? = 6.881 


which is the sum of squares of the discriminant function coefficients. 
The normal deviate with zero mean and unit standard deviation is 


4.300 — 0.859 3.441 V6881 D _ 
= ies 1.312 


V6.881 v68 è 2 


296 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


with a probability of about 0.095, which is the error of wrong classifica- 
tion associated with any group. The value of the discriminant function 
for the Highdown skull is —2.661 which assigns him to the Maiden 
Castle series. 

In Table 8b.4y the probability of error in using the first i = 1, Dy. = 9% 
6 characters is given in column (7). It is seen that the probability of 
error decreases with the increase in the number of characters, although 
such a decrease is inappreciable in some cases. For instance, the addi- 
tion of the last character NH,L leading to the calculation of Ye does not 
add much, the decrease in error being very small, less than 2 in 1000 or so. 

The evaluation of the discriminant function coefficients and the tests 
of significance used above are easily carried out with the use of trans- 
formed variates. An alternative way is to adopt the computational 
scheme of Table 7b.68 in Chapter 7 where successive values of D? and 
the discriminant functions were obtained. The variance of the discrim- 
inant function (treated as a linear combination of the measurements) is 
D?. This is useful in computing the frequencies of wrong classification. 


8b.5 The Doubtful Region 


In using the discriminant function in the example of 8b.4, the critical 
value separating the individuals of the two groups was obtained as the 
middle value of the mean values of the function for the two groups. 
The frequency of wrong classification for any group is 9.5%. Suppose 
that an individual is drawn at random from two such populations (con- 


sidered above) mixed in the proportions 74:72. The probability of 
wrong assignment of an individual in such a case is 


71(0.095) + 72(0.095) = 0.095 


which is the same as that for a sin, 
minimum possible error when x. 
associated with the section 


gle group. This is not, however, the 
1 and zz are known, the minimum being 


” lrt is 


A 3 t loge T2 — loge m; (8b.5.1) 


where L; and L are the mean values o; 


the two groups, the higher value being associated with the first group. 


If L >, then the individual is assigned to the first group; otherwise 
to the second. Suppose that Tim 1i 1:2, then 


f the discriminant function L for 


X= —0.859 + 0.693 = —0.166 


THE DOUBTFUL REGION 297 


in which case the errors of wrong classifications for the two groups are 
Q= 0.15 and a= 0.05 


giving a total error 
(0.15 + 0.10) _ 


3 0.08 nearly 


ma, + T2a2 = 


Although the error is nearly 8%, the error of misclassification for an 
individual of the first group is as high as 15%, so that an individual 
assigned to the second group cannot be asserted to belong to the second 
group since his chances of belonging to the first group are as high as 15%. 
Consider another situation where a doctor wants to discriminate 
between two types of neurotic, psychopaths and obsessionals, on the 
basis of some tests. If the test scores of properly diagnosed neurotics 
are available, then, assuming that the ratio of the two types of patient 
admitted into the hospital in the past represents the ratio in the general 
Population, the doctor can set up the criterion (8b.5.1). By following 
this procedure he can minimize the number of cases of wrong diagnosis. 
But in problems like this the groups overlap to a large extent so that 
even by following the best procedure the percentage of wrong classifica- 
tions remains quite high. By increasing the number of characters this 
Percentage could be made smaller and smaller but not always below an 
irreducible minimum because of the correlations between the characters. 
Furthermore, a stage may be reached at which the cost involved in 
further examination will not be commensurate to the reduction in the 
number of wrong classifications. But, subject to a given cost, the in- 
dicators * can be chosen so as to minimize the number of wrong classifi- 
Cations. Thus one has to balance between the errors committed and 
the time or money available. 
_ By following this procedure it may be difficult to assert that an 
Individual belongs to one group or the other unless the groups are well 
Separated, in which case the proportion of wrong classifications will be 
low. On the other hand, one may take the view that, whatever may be 
the basis of judgment, in some cases it should be possible to give a deci- 
Sive answer (subject to a small risk) whereas in others no decision or 
Only provisional decisions can be made. The latter group comprises the 
doubtful cases which need further examination. 
* For instance, there are two types of jaundice which are difficult to distinguish. 
One calls for a surgical treatment; the other for medical treatment. A discriminant 
Unction based on two biochemical tests is used in practice to ensure a greater cer- 
tainty of diagnosis for far less laboratory work. 


298 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


Cases also arise where the question asked is whether a selected indi- 
vidual can be asserted to belong to one particular group out of a given 
number of possibilities. Consider the problem of the Highdown skull. 
The grave findings associated with the skull excavated from the ‘“‘inva- 
sion horizon” do not give any conclusive evidence as to whether the skull 
belonged to a Bronze Age “defender” or to an Iron Age “invader.” It 
may or may not be possible to give a definite answer in such a problem. 
The case has to be judged on its individual merits, with consideration 
given to the probability of the individual’s having come from one group 
or the other. 

Sitting on the fence is a scientific attitude if it means looking for 
further evidence and better methods of judgment to be able to give a 
definite answer. 

Consider a doctor who has a routine method of diagnosing a disease 
or discriminating among a number of diseases. Although by following 
this procedure he commits the least possible errors, he would like to be 
more confident about his diagnosis in some selected cases. If the rou- 
tine method does not give him sufficient assurance, he may supplement 
it by further tests. 

For any specially chosen case like this or for an individual find such 
as the Highdown skull, the rule of procedure suggested should neces- 
sarily be independent of the a priori probabilities used in the general 
problem of discrimination, First, such a priori probabilities may not 
be available jin the case of the Highdown skull it is not possible to know 
the proportions of the Bronze and the Iron Age cranial population. 
Second, even if such knowledge is available from previous experience, 
this is not strictly applicable in a case not chosen at random from a 
mixed population. For instance, the proportions applicable to the 
Highdown skull may depend on the numbers of Bronze and Iron Age 
warriors who went down fighting and not on the general proportion. 

Thus a problem involving only one individual must be distinguished 
from a problem in which a number of individuals have to be classified 
into a given number of groups by means of suitable criteria. The lat- 


ter supplies a provisional answer to the former, but for definite answers 
suitable criteria have to be developed. 


Let us suppose that for the best sol 


ution of assigning individuals to 
the first group if perenne 


mifi(e | 1) > mofolx | 62) E 


and to the second group if 


mofo(x | 02) > mfy(x | 61) 


THE DOUBTFUL REGION 299 


the expected proportions of wrongly classified individuals of the first 
and second groups are a; and az, respectively. If a, and a are small, 
then we can assert for any given individual that he is rightly classified. 
Otherwise we may follow the procedure of assigning an individual to 
the first group if 

file | 01) = Afele | 02) 
to the second if 

fila |) < Bfo(« | 62) 


and remain in doubt if 
Afo(x | 02) > file | 02) > Bfo(x | 02) 


The quantities A and B are chosen such that the probabilities of 
wrong decisions are at assigned levels. The diagram below (Figure 1) 
shows the nature of the decisions that could be made after ascertaining 
the value of the ratio or its logarithm. 


Ro | D | Dı l Ry 
=> log (f1/f2) B' c’ A’ 


B’ = log, B A’ = log, A C’ = loge T2 — log; T1 


FIGURE 1 


In the region Rə the individual can be asserted (at a given risk) to 
elong to the second group; in Dz he can be provisionally assigned to 
the second group; and similarly for Rı and Dı. In doubtful cases it 
may be possible to measure more characters and thus bring in further 
evidence to decide the issue. ; 
In the example of the Highdown skull with 7 = 34 and 73 = 24, it 
has already been shown that the point of section is —0.166 so that if 
the discriminant value exceeds —0.166 the individual is assigned to the 
Bronze Age. The point corresponding to the 5% level of errors for the 
Maiden Castle series is 


—4,300 + 1.645D = 0.016 D = 2.624 


so that unless the discriminant value exceeds 0.016 an individual cannot 
be asserted to belong to the Bronze Age, although provisionally he will 
be put in the Bronze Age as soon as the value exceeds —0.166. Simi- 
larly, the 5% value for the Bronze Age is 


2.581 — 1.645D = —1.735 


300 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


and unless the value of the discriminant function is below —1.735 the 
individual cannot be asserted to belong to the Maiden Castle series. 
The value for the Highdown skull is —2.661 so that he can be confi- 
dently assigned to the Maiden Castle series. 


8b.6 Resolution of a Mixed Series into Two Gaussian Components 


In the previous sections were considered problems of determining the 
group to which an individual belongs when the distributions in the al- 
ternative groups are known. There may arise cases where a collection 
of individuals is observed but no information is available as to the dis- 
tributions in the groups from which they have arisen or the proportion 
of mixture. The general problem is then to determine the character- 
istics of the various groups and also the proportion of mixture from the 
available set of measurements. This information may be finally used 
to specify the group of each individual, if necessary. 

Considering only two groups in which a certain character is dis- 
tributed normally, the statistical problem reduces to that of estimating 
from the observed frequency distribution the two mean values p1, H2 
standard deviations o1, o2, and the proportion of mixture m. The esti- 
mation of these five parameters by the method of moments was dis- 
cussed by Pearson (1894). The estimates depend on a suitably chosen 
root of a nonic (ninth-degree equation) constructed from the first five 
moments of the observed frequency distribution. 

In many problems it is reasonable to suppose that c1 = øz, in which 
case there are only four parameters to be estimated. If the method of 
moments is followed, the first four moments are sufficient, for it has been 
shown that the estimates depend on the negative root of a simple cubic 
equation constructed from the first four moments. In practice, where 
large numbers are involved, the estimates obtained by the method of 
moments, though not efficient, may serve the purpose at hand. Where 
higher efficiency is aimed at, the estimates have to be found by the 
method of maximum likelihood. The numerical computations involved 
in this method are very complicated. 

Whatever the method of estimation employed, the numerical compu- 
tations become much simpler when the standard deviations are assumed 
to be equal. The simplifying assumption may introduce bias in the 
estimates when, in fact, the standard deviations differ. Such estimates 
are, however, more accurate than those obtained without this assump- 
tion when the bias in any estimate is smaller than its standard error- 
If the mean values and the proportion of mixture are to be estimated 


with a higher precision, small differences in the standard deviations ca 
be ignored. 


RESOLUTION OF A MIXED SERIES 301 


Estimation by the Method of Moments. The rule of estimating the 
parameters by the method of moments consists in equating the moments 
as calculated from the observations to functions of parameters repre- 
senting the moments in the population. Since the expectations of cal- 
culated moments are not the same as the moments in the population, 
this method might introduce a little bias in the estimating equations. 
This bias can be avoided by equating the calculated moments to their 
expected values. Instead of this, we can choose the system of k-statistics 
of Fisher (defined in Statistical Methods for Research Workers) and equate 
them to their expectations which are the cumulants of the distribution. 
If so, s3, and sy are the second, third, and fourth moments about the 

` mean, and sı the first moment about the origin, as calculated by the 
usual method, the first four k-statisties derivable from them are given by 


ky =s 
n 
ky = 
k (n — 1) s 
n2 
kgs = —————> 33 
(n — 1) — 2) 
ka pä {(n + 1)s4 — 3(n — 1)s0"} 


a 
(n — 1)(n — 2) — 3) 


If the moments are calculated from grouped data with class interval h, 
the quantities 149 h? and 420 hé have to be subtracted from the expres- 


sions for ko and k4, respectively. 
If p, my, mo, and s denote the estimates of 7, #1, H2, and a, the com- 


mon standard deviation, then the estimating equations by this method 
are 


l1=pt+d 

kı = pm, + gm2 

ka = ? + ph? + qd 

kg = pd? + qd? 

ka = phit + qd* — 3(pdi” + gdz 


= mə — kı. From the defini- 
s the negative root of the cubic 


2ya 


Where q=1— P, dı =m — ky, and də 
tion « = dyd, the value of x is obtained a: 


a + kas + Zk? = 0 


302 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


If x is the required root, then d; is given by the negative root of the 
quadratic 


k 
iid he doped 
zx 


and də by —(k3/x) — dı. The estimates mı, me, p, and s are given by 
m =k +d m = kı + dz 
de 
P = a= d) 


The fundamental cubic equation 


L=k +r 


a3 + Bye + Zka? = 0 
introduced above has a single negative root greater than (—h,). The 
best method of determining the root is to start with a trial value and 
obtain the correction by Newton’s method of approximation. Since 
the equation is in a reduced form with the coefficient of x? absent, it 


is easy to guess the root correct to the nearest integer. If x, stands for 
the trial value, then the additive correction ôt; is given by 


k4 1 1 
8,2 JE = =m? — = katı — — ka? 
[ ar 1 e oe 


The process is repeated until the ex 
comes very small. The data of Tabl 
tion of heights of 454 plants of two 


pression on the right-hand side be- 
e 8b.6a give the frequency distribu- 
different types grown on the same 


TABLE 8b.6a. The Frequency Distribution of Height in Centimeters for 454 Plants 
Class Interval Frequency 
7.5-8.5 3 

9.5 9 
10.5 21 
11.5 40 
12.5 59 
13.5 76 
14.5 79 
15.5 69 
16.5 46 
17.5 30 
18.5 13 
19.5 7 
20.5 2 


454 


RESOLUTION OF A MIXED SERIES 303 


plot. The plants are indistinguishable except at the flowering stage. 
The problem is to estimate the mean height of the two types of plant, 
their common standard deviation, and the proportion of mixture. 

The values of cumulants after adjustment for grouping are 


ky = —0.244493 about 14 as the origin 


ko = 4.975963 
k3 = 0.728751 
k4 = —5,314741 
thy = —2.657370 $k? = 0.265539 


The fundamental cubic is 
2° — 2.657370 + 0.265539 = 0 
Taking —1.65 as a trial root, we find the correction ôx is given by 
[3(1.65)? — 2.657370] ôx = —(—1.65)3 + 2.657370(—1.65) — 0.265539 
5.510130 6a = — 0.158074 
ôx = —0.0286878 
Similarly, the second correction is 0.000707 so that the second approx- 
imation is — 1.678688 + 0.000707 = —1.677981. The quadratic giv- 
Ing dy is 
dy? — 0.434302d; — 1.677981 = 0 
which yields the negative root 
dı = — 1.096293 
da = 1.096293 + 0.434302 = 1.530595 
The estimates of m, and mg about 14 as the origin are 
mı = —1.096293 — 0.244493 = — 1.340786 
1.530595 — 0.244493 = 1.286102 
dz 
(da — dı) 
q = (1 — p) = 0.417335 
s2 = 4.975963 — 1.677981 = 3.297982 


ll 


ll 


m2 


p= = 0.582665 


s = 1.816035 


304 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


This completes the estimation of the four parameters by the method of 
moments. 

The expressions for standard errors of these estimates are very com- 
plicated, but it appears that the estimate of the proportion of mixture 
will have the highest percentage of error whereas the estimates of 
means and standard deviation will be fairly reliable in large samples. 

A good deal of caution is needed in resolving a mixed series. 

G) It must be ascertained that the population is a mixture of two 
homogeneous groups only. 

(ii) Departure of the individual distributions from the Gaussian type 
introduces serious errors in the estimates. 


(iii) Samples must be large enough for a successful resolution into 
two components. 


8b.7 Sexing of Osteometric Material 


In anthropometric work the problem of sexing often arises. The sex 
of an excavated skull or jawbone can be ascertained with a high chance 
of success if the associated pelvis is also found. Sometimes the sex of 
the skeleton may be determined by external evidence such as beads and 
other ornaments found buried with the body. Anthropologists of the 
“trained eye” school claim that a skull can be properly sexed by ana- 
tomical appreciation. In such cases, the conclusions may be marred 
by a subjective bias unless the features of the specimen examined are 
so striking as to leave the question of its identity in little doubt. 

In any case two types of situations have to be faced. Some skeletal 
remains can be definitely sexed, but how can the other bones belonging 
to the same series be sexed? All those bones which have been sexed 
supply the basic material with which suitable discriminant functions 
can be constructed. The sex proportion can also be ascertained from 
the basic material. The problem, then, reduces to the simplest one of 
discrimination between two groups when the individual distributions 
and the a priori probabilities are known. 

It may be necessary to obtain discriminant functions based on dif- 
ferent sets and subsets of characters to suit the various bones which 
have to be sexed. For instance, a skull may be in a broken condition 
so that only the length and breadth of the cranium can be measured. 
A decision is then made with the help of a discriminant function based 
on length and breadth of the cranium only. Some skulls may admit 
facial and nasal measurements as well, in which case the discriminant 
function based on all these measurements has to be used. The method 
of transformed characters and the construction of Teon func- 
tions adopted in 8b.5 are very useful in this connection, provided that 


SEXING OF OSTEOMETRIC MATERIAL 305 


the order in which the characters are added is suitably determined. 
For instance, consider the five measurements of the cranium: length 
(L), breadth (B), frontal breadth (B’), height (’), and circumference 
(S). The order here is obviously L, B, B’, H’, and S, for there may be 
some skulls providing measurements on L, B, and B’ alone and not on 
the rest, whereas skulls admitting the measurements of H’ and S must 
necessarily supply the measurements L, B, and B’. 

We have another situation when adequate material is not available 
for the construction of the discriminant function. Then an approximate 
function can be tried, the simplest of which is of the type 


Ty T2 
ee 
0l 02 


where -+ or — is chosen according as the male mean for x; is greater or 
smaller than the female mean. The values of the standard deviations 
01, 02, +++ of 21, to, +++ need not be known exactly. Values from any 
related series can be used because what is important is the relative order 
of the various standard deviations. This formula does not make use of 
the actual mean values but only of their inequality relationship. We 
shall term the above expression as the “general size factor.” 

In the problem of sexing, the size factor can be conveniently employed 
because the inequalities relating to various measurements of the male 
and female are known. Most of the linear measurements have higher 
values for males, whereas angles and some indices have higher values 
for females. The series of values of the size factor calculated for each 
specimen to be sexed can first be arranged in decreasing order of mag- 
nitude and then divided in a given sex ratio assigning the higher values 
to males and the smaller ones to females. ’ 

Some difficulty arises when the sex ratio is unknown. The series of 
Size factor values may then be treated as in 8b.6 for resolution into two 
Gaussian components. This supplies the sex ratio and other constants 
which may be useful in setting uP the best procedure for discrimination 
based on a single characteristic, viz., the size factor, or only the sex ae 
may be used to divide the series into males and females, as suggeste 
above. s A 

As observed earlier, the resolution into Gaussian components is no 


always a happy proposition because the estimation involves the calcu- 


lated higher moments which are subject to large standard errors. An 


alternative procedure is as follows. p 2 
First we ae that the standard deviation of the size o e 
a homogeneous group (either d'd’ Or 2 9) can be obtained trom y 


306 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


related series. If kə, kz are the second and third cumulants of the mixed 
series of the size factor, then the estimating equations are 


Pid, + pods = 0 
dd, = — (kz — o°) 


kg 

dı + dz = hoe 
də 

"o aa 


where dı = m, — ky and dy = mə — kı, as defined before. The use of 
the fourth moment is avoided by this procedure, but it remains to be 


seen what error is committed by choosing a wrong ¢. Consider the 
previous example with 


ky = —0.244493 k2 = 4.975963 kg = 0.728751 


Assuming three different values 


o = 4, 3.297982, 3 
the equations are 


dydy 


—0.975963, —1.677981, —1.975963 


dı + dz = 0.746699, 0.434302, 0.368808 
with solutions 


dı 
də 


—0.682753, —1.096293, —1.233329 
1.429452, 1.530595, 1.602137 
p= 0.676758, 0.582665, 0.565035 
of which the middle one corresponds to the solution already obtained. 
It appears that p is stable for small variations in ¢, 


Another method is to avoid the calculation of even the third moment 


but to use the median instead. The computations will be slightly 
heavier. 


If the mean values are avail 
discriminant function is 


able, then a better approximation to the 


kızı + kara H+ +++ kytp 
where k; = d,/o2, 
character. This is 


tions are neglected. 


di being the difference in mean values of the ith 
exactly the discriminant function when the correla- 


THE PROBLEM OF THREE AND MORE GROUPS 307 


Example 1. Construct the discriminant function, assuming differ- 
ences dı, də, +*+, dp in the mean values and a correlation matrix with 
all correlations equal to a constant value r. The function can be ex- 
pressed in terms of two linear functions of the variables 


ped perce 


Cal 2 Op 


Q= hm taet kptp 


and 


where 


poe 
of 


The second factor is called the shape factor by Penrose (1947) who uses 


a slightly different form. 
Example 2. Consider a correlation matrix of the type 


(s) 


matrices and B is a matrix with 
riminant function depends on 
ther factor which may be 


where A and C are two equicorrelation 
all its elements constant. Here the disc 
two sets of size and shape factors and on ano 
called the bipolar factor. 


8b.8 The Problem of Three and More Groups 

In 8b.5 it is seen that, if measurements on a certain number of char- 
acters are available for two groups, it is possible to construct a dis- 
criminant function which affords the maximum discrimination between 
them. This function is useful in assigning with a certain degree of 
Confidence an individual or individuals to one or the other of the two 
groups to which they are known to belong. In taxonomic pbi 
there arise cases where an individual specimen 1s known to belong o 
one of three or more groups and has to be assigned to its proper group. 
Thus a plant may have to be specified as Iris versicolor, Iris epai 
Iris verginica. This problem is approached by the extension of the : _ 
Criminant function analysis developed with special reference to two 
` groups. iW 7 
The General Theory for Three Groups. Let the probability densities 


in the three groups be represented by fi (1, 01), Jele, 02), B e ba 
£ stands for the available set of measurements and 0 for the el tion 
First we shall consider the general problem of classifying A = er z 
of individuals drawn from a mixed population containing 10 _ 


: = 1). 
the three groups in the proportions, 71, Ta, T3, (m1 + Ta + T3 ) 


308 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


Any individual I characterized by p measurements can be represented 
by a point in a p-dimensional space. The problem of classifying an 
observed collection of individuals is the same as the division of the space 
into three mutually exclusive regions, Fj, Rə, Rz, with the rule of pro- 
cedure of assigning an individual J, represented by a point in R;, to the 
ith group. If the probability that an individual of the ith group will 
fall in R; is 8;, then the expected value of the proportion of wrong classi- 
fications is 


a = 1 — (748; + 7262 + 7383) 


The errors will be a minimum for that choice of regions Rı, Re, Ra for 
which 716; + T282 + 7363 is a maximum, Such regions, if they exist, 
may be termed the “best. possible” regions. The following theorem 
establishes the existence and nature of the best possible regions. 
Theorem 1. The regions defined by (N) 


Rı N mfi > Tofo, mfi T3f3 
Roe N mf222 mfu nef, > Tafa 


R N Tafa Z mifi, Tafa > Tofo 


constitute the best possible system of mutually exclusive regions. 


The result follows from the lemma proved in Appendix A2. It is 
interesting to observe that this solution is the same as that obtained 
from Bayes’s theorem on posterior probabilities. 

If every individual is equi 


ally likely to be drawn from any group, the 
best regions are 


RN Azm f = fs 
fe 1 faf fe 2f 
m N Ak h >f 
These regions may be used fo 
individuals when nothing is kn 
mixture. This is the maximu: 


classification. We choose that 
maximum. 


By adopting this procedure the probability of an individual of the 


first group being rightly assigned is Í fı dv = By, and the probabilities 
Ry 


third groups being wrongly assigned 


r classifying an observed collection of 
own about m1, 72, 73, the proportions of 
m likelihood method in the problem of 
hypothesis for which the likelihood is a 


of the individuals of the second and 
to the first group are 


aiz =fr dv and Q13 = [fs dv 
Ry Ry 


THE PROBLEM OF THREE AND MORE GROUPS 309 
Since, in Ry, fi > fo, fı = fa, it follows that 
Bı > the greater of aio and 13 


; If ay and as are small, we can assert with some confidence that an 
individual falling in R, is correctly classified. If they are not small, it 
is pertinent to inquire whether there exists a region C4 such that. 


fidv isa maximum 
Cı 


subject to the conditions that 


fz Ww and ff dv 
Cı Cı 


are both not greater than a quantity «, chosen to be small, say 0.01 or 
0.05. If an observed specimen falls in such a region, then the hypothesis 
that it belongs to the second or the third groups may be rejected, in 
which case it is assigned to the first group. The existence and nature 
of such regions are established by theorem 2. 

Theorem 2. Region C; satisfying the condition 


fidv isa maximum 
Cı 


subject to the restrictions 


ay > fh dv and fi dv 
Cı C: 


1 


is defined by oes 
12 We 3 


where a and b are suitably chosen. 
The proof of this theorem follows from the lemma of Neyman and 


earson given in Appendix Al. To apply this lemma consider two 
quantities a1, a13, both less than the assigned quantity a1, and choose 


a region such that 


f f,dv isa maximum 
w 


Subject to the conditions 


fi dv = a2 fi dv = an3 
w wo s 


310 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 
The inside of such a region is defined by 


fi 2 afz + b’fs 
where a’ and b’ are properly chosen. Let the maximized value of 
f fi dv, which is evidently a function of 12, o13, be represented by 
w 


B(c12, a13). This is not, in general, an increasing function of aiz and 
a3; therefore the maximum value is not necessarily attained when 
aiz =a13 =a. If, now, the function B(ey2, 013) is maximized with 
respect to a12, a13, subject to the conditions a2, a3 < @, we obtain 
two values, 42°, 243°, corresponding to the optimum solution. Denot- 
ing the values of a’, b’, corresponding to 12°, a43°, by a, b, the required 
region may be written fy = afz + bfz, which proves theorem 2. 

It is easy to see that at least one of the values 219°, œg? coincides 
with the boundary value a. Consider the best region corresponding to 
12, 13 With both less than ay. If 12 > ag, it is always possible to 
add a region in which fe È fz so that 2 is increased to a, and ay3 
toa value < a. If this is not possible, a region in which fo < fg can 
be added such that at least one value, œs or aiz, reaches the value a. 
The value of B(ey2, œz) is increased in any case. 

Having obtained regions Rı and Cı as given in theorems 1 and 2, we 
may specify an individual falling in the Cı region as belonging to the 
first group, and an individual falling in D; = Rı — C; as likely to be- 


long to the first group. Regions Rə, C2 and Rz, C3 can be similarly 
constructed. 


If the proportions Ti, Te, 


T3 considered in theorem 1 are known, then 
region C4 is determined by 


Si 2 a(mefy + Tafa) 


where a is chosen such that f (m2fo + Tafa) = ay; and so on. The 
c 
position is shown in Figure 2, — 


A certain amount of simplification results if the best region C; is 
replaced by 


C A ASA f 2 Bfs 


where A and B are chosen such that 


f fidv isa maximum 
cy 


THE PROBLEM OF THREE AND MORE GROUPS 311 


subject to the conditions 


f pases MELE 
c cy 


F 
x 


This region is not the best possible, but it is likely to be a good approx- 
imation. 

In some situations it may be necessary to find regions Ry, Re, Rg such 
that the errors of classification are the same for each group or are to be 


h=Agh+ Bh 


C3 


Fiaurer 2. The division of the space for six possible decisions. 


in given ratios pitpeip3. The existence and nature of such regions are 
established by the following theorem. 
Theorem 3. The system of regions 


R, N af, > bfz, afi = fs 

Re N bfz = efs, bf2 = aft 

Rs n cfs > afi, cfs = bfo 
where a, b, c are suitably chosen, are the best possible if the errors of 
Classification for the three groups are to be in an assigned ratio. 


Let Ry’, Rə’, R3’ be any other set of regions for which the errors of 
classification are in the assigned ratio. The region common to #,’ and 


312 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


R; is represented by Rij Then 


af fi dv -f af, dv +f oh dv +f on dv 


a u 


<af fav+d fedw +e] fad 
R 


11 Ris Ria 


Similar relationships can be set up, starting with T bfo dv and 
Ry 


f cfs dv. The errors of classification with respect to the systems Ri, 
Ry 


Rə, Rs and Ry’, Re’, Rs’ may be represented by api, Ape, apz and a’pi, 
apo, a’p3, respectively, pi! po: p3 being the assigned ratio. Writing down 
the values of integrals in the above three relationships and adding, 

(1 = pia’)a + (1 — p2a')b + (1 — pger’)e 


< (1 — piaja + (1 — pza)b + (1 — p3a)c 
or 


—al (apy + bpp + cps) < —a(apı + bps + cps) 
i.e., a’ > a, since a, b, and c are positive quantities by definition. This 
proves the result of theorem 3. 
The quantities a, b, c are to be determined from the relations 


1 1 1 
= fida = — fo dv = — fs dv 
P1 Y Ro+R P2 Y Ri+R3 P3 Y Ri+Ra 
Von Mises (1945) considers the problem of classification which mini- 


mizes the maximum error, For any system of regions Ry, Ro, Rg, the 
errors associated with the three groups are 


oa, =1—] fi dv a=1~f fray a= 1- ff fad 
Ri Ra Rs 


The set of regions for which the maximum « is a minimum is recom- 


mended by Von Mises for possible use in problems of classification. It 
is first easy to see that for such regions 


a1 = & = ag 
for, if an inequality relationship is true, sa; 
reallocate the regions Rg and Rg such that 
creased, thus reducing the maximum a. 
by this method when a, = ay = az, 


y a > az, it is possible to 
æ is decreased and a3 is in- 
No improvement is possible 
in which case we can choose the 


THE PROBLEM OF THREE AND MORE GROUPS 313 


regions with the help of theorem 3 to minimize this common value, i.e., 
when p; = p2 = p3. 

The minimax requirement is to some extent unrealistic. Consider a 
situation where two of the three groups are close together and the other 
is quite distant. If the individuals of the distant group are considered, 
the chance of misclassification should be small whereas the chance of 
error for any one of the closer groups should be high. No compromise 
is served by equalizing these errors. As observed earlier the maximum 
likelihood solution can be used when nothing is known about the a 
priori probabilities. When these regions are used, the requirement 
stated above is automatically satisfied. Also, there does not exist any 
other set of regions which is uniformly better than this set in the sense 
that the errors are smaller for each group. 

Theorem 2 led us to the construction of 4 mutually exclusive regions 
with the help of which an observed specimen can be assigned either to 
a particular group or to none. In some problems it may be necessary 
to construct a system of 7 regions, 3 for assigning an observed specimen 
to particular groups, 3 others for specifying it as belonging to one of 
two of the groups, and the remaining one for making no decision. To 
Construct these regions we set up three regions W, W2, W3 for not accept- 
ing respectively the first, the second, and the third groups, as the pos- 
sible ones from which the observed specimen has arisen. The boundary 
surfaces of these regions determine by mutual intersection the required 
System of 7 regions. The region outside w1, W2, Ws is the doubtful re- 
gion; the intersection of w; and wy is the region for specifying an indi- 
vidual belonging to the kth group, (k # i # j); and the region outside 
Wi, w; but inside wr is for either the ith or the jth group. Some methods 
of constructing regions W1, W2, Ws are discussed below. 4 f 

Regions w; when 71, T2, T3 are known: If 7, 72, 73 considered in 
theorem 1 are known, then region w is such that 


f mf, W = a (a small assigned quantity) 


wi 


and 


f (mofo + Tafa) dv isa maximum 


wi 
The boundary surface of such a region is 
mifi < a(mofe + Tafa) 


Where the constant a is suitably determined. Similarly, w2 and w3 can 
e constructed. 


314 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


Regions w; independent of any a priori information: Consider the re- 


gion w’ such that 
f fi dv = 
wi 


f fo dv =f Jfadv =ß isa maximum 
w w 


and 


Having determined £, it is possible to construct the region w; such that 


fia dv = ay 
wi 


fiw- i=l 

wi 

fi dw isamaximum jæi 
w 


fros fra 


a i = 3 otherwise. The boundary surface of such a region is of the 
orm 


where i = 2 if 


f< afa + bfa 
where a and b are suitably determined, 


There is an alternative method of constructing w1, we, ws which may 
be useful in some practical situations. Let Be and 63 be the maximum 


values of 
fp dv and f fz dv 
us ig 


subject to the conditions 


fiwa fiw=a 
uz uz 


where ug and ug are the regions corres; 
w, may be determined such that 


fa dv = a 
wi 


ponding to the maxima. Region 


THE PROBLEM OF THREE AND MORE GROUPS 315 


and 


2 fi dv = x fife dv isa maximum 
B3 Juw 
The boundary surface of such a region is again of the form 
fi < afz + Of 
where a and b are suitably determined. 


Denoting the populations or possible alternatives by Hı, Hə, H3, the 
rule of procedure is as indicated in Figure 3. 


Se Accept H, 
SS 
Sa 
N 
Sy 
ks 
Accept Accept 
H, or Hy Hy, or Hy 


— 
Accept H, \ Accept H, 


Accept H, or H \ 


\ 


Boundary of the critical region for rejecting H,—— 
Hy—-— 


Ficurn 3. Division of the space for seven possible decisions. 


316 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


8b.9 Application to Multivariate Normal Populations 
For multivariate normal populations the probability density for the 
rth group is 
J, = const. exp — 3 {ZEN (t; — ui) (tj — ujr)} 
The surfaces of constant likelihood ratios are defined by 


DO (uir — His) }2; = const. 


A i 
r,s=1,2,3 


These surfaces’ can also be defined in terms of what may be called 
linear discriminant scores defined in terms of the constants for the rth 
group only. 


L, = (ny) 2; — PEE wipe 
ki i 


r=1,2,3 


A constant likelihood ratio corresponds to a constant difference in the 
discriminant scores. If the a priori probabilities are T1, T2, 73 for the 


three groups, then the rule of procedure is to assign an observed indi- 
vidual to that group for which 


L, + log, m, 
is a maximum. 


Example 1. The scores in 
cruits classified by their neu 
given in Table 8b.9a, 


three tests, A, B, and C, of 256 army re- 
rotic condition have the mean values as 


TABLE 8b.9a. Mean Scores of Neurotic Groups (Rao and Slater, 1949) 
Sample Mean Score 
Group Size A B c 
Anxiety state 114 2.9298 1.1667 0.7281 
Hysteria 33 3.0303 1.2424 0.5455 
Psychopathy 32 3.8125 1.8438 0.8125 
Obsession 17 4.7059 1.5882 1.1176 
Personality change 5 1.4000 0.2000 0.0000 
Normal 55 0.6000 0.1455 0.2182 


The dispersion matrix within th 


e groups and its reciprocal are given 
below. 


Within Dispersion Matrix (Ax) Reciprocal (,¥) 


A B € A B C 
A 2.300851 0.251578 0.474169 0.543234 —9 200195 —0,420813 
B 0.251578 0.607466 0.035774 —0.200195 1.725807 0.055767 
(6) 0.474169 0.035774 0.595094 —0.420813 0.055767 2.012357 


APPLICATION TO MULTIVARIATE NORMAL POPULATIONS 317 
For any group the linear discriminant score is 


LA +B +C — $m, + ləm + lzm) 
where 
1, = N!m, + A?m + 323 


l = Xim, +A? mg + Am 
Is = Xim, + A?m + Xm 


mı, m2, mg are the mean values of A, B, C, and the elements \” belong 
to the reciprocal of the dispersion matrix. For the anxiety state group 


mı = 2.9298 mg = 1.1667 mg = 0.7281 


l = 0.5432(2.9298) — 0.2002(1.1667) — 0.4208(0.7281) = 1.0515 
l> = —0,2002(2.9298) + 1.7258(1.1667) + 0.0558(0.7281) = 1.4676 
l3 = —0.4208(2.9298) + 0.0558(1.1667) + 2.0124(0.7281) = 0.2975 


Em, + lema + lms) = ${1.0515(2.9298) +--+} = 2.5047 
Hence the discriminant score for the anxiety state is 
L = 1.0515A + 1.4676B + 0.2975C — 2.5047 


For purposes of classification the expression to be calculated is L + log. 7, 
where x denotes the relative frequency of cases of anxiety state. The 
discriminant scores involving the relative frequencies are given in 
Table 8b.98. 

The present data are not a representative sample of officers serving 
in the Army and the Navy. The sample of neurotic officers has not been 
exposed to any known bias, and the proportions between the numbers 
in the various groups may be fairly representative; but the number of 
normal officers is grossly under-represented. It seems impossible to 
obtain a reliable general estimate of the risk that a man will be referred 
toa hospital for the treatment of a neurosis while he is serving in the 
Army or the Navy as an officer; but the indications are that, even 
under conditions of very severe stress, it is not more than 2 to 3 per 
cent. For proportional representation over 100 times as many normal 
cases should have been reported. 

In Table 8b.98 the formulae have been given in terms of general 
Telative frequencies and also for the particular values realized in the 
Sample although they are subject to systematic as well as to chance 
errors, The formulae are unsuitable for practical use unless reliable 
estimates of the relative frequencies are available. 


318 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


TABLE 8b.98. The Linear Discriminant Scores for Various Groups 


s Constant Term 
Coefficients of 
Measurements 
For Propor- 
Group (a) In Terms of Ga s ka 
A B c General Proportion Prosent: Datat 
Normal 0.2050 | 0.1431 0.1947 | —0.0931 + loge x —1.6311 
Personality 
change 0.7204 | 0.0649 | —0.5780 | —0.5107 + loge 72 —4.4465 
Anxiety state | 1.0515 | 1.4676 0.2974 | —2.5047 + log. 73 —3.3137 
Hysteria » 1.1678 | 1.5679 | —0.1081 | —2.7139 + loge 74 —4.7626 
Psychopathy 1.3599 | 2.4641 0.1336 | —4.9182 + log. T5 —6. 9977 
Obsession 1.7680 | 1.8611 0.8573 | —5.8375 + loge 76 —8.5495 


* mı = 0.21484, m2 = 0.01953 
= 0.06641. 


» 73 = 0.44531, r4 = 0.12891, m5 = 0.12500, 76 

Given the measurements A , B, C of an indiv 
linear discriminant scores Lı +++ Le, 
and assign him to the group for w. 
priori probabilities are not known 
leads to the rule of assigning an ind: 
highest. 

Example 2. Table 8b.9 
groups of individuals meas 
considered again in Cha 
characters are given. 


idual, we calculate the 
corrected for a priori probabilities, 
hich his score is highest. If the a 
, the maximum likelihood method 
ividual to that group for which L is 


y gives the statistical constants for three 
ured by D. N. Majumdar. These groups are 
pter 9 where the constants for all available 


TABLE 8b.9y. Statistical Constants for Three Groups 


Mean Values 


Sitting Nasal Nasal 
Stature Height Depth Height 
Group (St) (SH) (ND) (NH) 
Brahmin 164.51 86.43 25.49 51.24 
Artisan 160.53 81.47 23.84 48.62 
Korwa 158.17 81.16 21.44 46.72 


Dispersion Matrix 


St 32.95 7.43 1.78 3.97 
e 10.24 1.17 2.43 
ND tata el als 3.06 1.78 
NH eae 


tees Bos 12.25 


APPLICATION TO MULTIVARIATE NORMAL POPULATIONS 319 


By inverting the dispersion matrix we can obtain the linear dis- 
criminant scores, as in the illustration of neurotic groups, and use them 
for classification. In order to determine the probabilities of wrong 
classification for each group it is necessary to go through a slightly com- 
plicated procedure. If there are only three groups, the four (in gen- 
eral, p) measurements can be replaced by two independent linear func- 
tions, given which the relative distributions in all the groups become 
identical. The problem is thus reduced to a two- variable case. The 
two independent functions can be obtained in a number of ways. One 
simple method is to calculate the discriminant functions for any two 
Pairs of groups. The computational method is to write down the dis- 
persion matrix with two appended columns: 


Differences in Mean Values. 


(SO (SH) (ND) (NH) 
Brahmin — Artisan 3.98 4.96 1.65 2.62 
Artisan — Korwa 2.36 0.31 2.40 1.90 


and reduce it, for solving equations (see 1d.1). 


32.95 7.43 1.78 3.97 3.98 2.36 
10.24 1.17 2.43 4.96 0.31 

3.06 1.78 1.65 2.40 

12.25 2.62 1.90 


The two sets of solutions give the two discriminant functions for the 
Pairs Brahmin, Artisan and Artisan, Korwa, 
X = —0.0039St + 0.4301SH + 0.3293N D + 0.0819N 
Y= 0.0476St — 0.1036SH + 0.7679ND + 0.0486NH 
Fa the discriminant function for Brahmin, Korwa is X + Y, which 
ave the following mean values. 


x me X+Y 
Brahmin 49.1224 20.9406 70.0630 
Artisan 46.2467 19.8706 66.1173 
Korwa 45.1766 17.8551 63.0317 


eh omer function for Artisan — Korwa gives the rule for dis- 
shing Artisan from Korwa when 
Y> 19.8706 t 17.8551 = 18.8628 
Similarly, Artisan is distinguished from Brahmin when 
= 2 


320 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


Therefore, the maximum likelihood method of classification for Artisan 
is Y > 18.8628, X < 47.6845. Similarly, by considering the discrim- 
inant functions between Brahmin, Artisan and Brahmin, Korwa, the 
rule for Brahmin is i 

X > 47.6845 X + Y > 66.5473 
For Korwa the rule is 

Y < 18.8628 X +Y < 66.5473 


obtained by considering the two discriminant functions separating 


Korwa from Artisan and Korwa from Brahmin. For instance consider 
an individual with : 


St = 16200 SH=8400 ND=2400 NH = 49.00 
The value of X = 47.4129, Y = 19.8198, and X + Y = 67.2327. Since 
19.8198 > 18.8628 and 47.4129 < 47.6845 

the individual is assigned to the Artisan group. 
Figure 4 gives the two-dimensional chart for X and Y with respect 
to which the individuals can be classified, The point (X, Y) for the 


21 


Bu 
(49.12, 20.94) 
20 A 
° I 
(46.25, 19.87) 
(47.41, | 
Artisan 
Brahmin 


Ku 
(45.18, 17.85) 


17 


45 46 


a 48 49 50 51 


Ficurr 4. The regions separating the three groups, 


APPLICATION TO MULTIVARIATE NORMAL POPULATIONS 321 


observed individual is represented by I and the mean values by By, 
Am, and Kyr. 

To determine the errors of classification we need find the variances 
Fo covariances of X and Y which can be simply obtained from the mean 
values. s 


V(X) = By — Ax = 49.1224 — 46.2467 = 2.8757 
V(Y) = Ay — Ky = 19.8706 — 17.8551 = 2.0155 
cov (X, Y) = Ax — Kx = 46.2467 — 45.1767 
= By — Ay = 20.9406 — 19.8706 = 1.0700 
V(X + Y) = V(X) + V(Y) + 2 cov (X, Y) 


= 2.8757 + 2.0155 + 2(1.0700) = 7.0312 
cov (X, X + Y) = V(X) + cov (Z, Y) 

= 2.8757 + 1.0700 = 3.9457 
cov (Y, X + Y) = V(Y) + cov (X, Y) 

= 2.0155 + 1.0700 = 3.0855 


The correlation matrix of X, Y and X + Y is 


x Y X+Y 
Xx ‘shat 0.4459 0.8810 
Fo satis ore 0.8200 
X+ Bigs 
Standard deviation 1.69 1.42 2.65 


The proportion of right classification for Brahmin is 
P(X > 47.6845, X + Y 2 66.5473) 


which give the deviates 


47.6845 — 49.1224 d 66.5473 — 70.0630 
1.69 = 2.65 
h = —0.85 and c = — 1.33 


The Probability for h > 0.85, k > 1.33 and r = 0.88 is given in Part II 
of Pearson’s tables for statisticians and biometricians. This value ‘is 
SP Proximately 0.085. The required probability for wrong classification is 
P(h > 0.85) + P(e > 1.33) — Ph > 0.85, k > 1.33) 

= 0.195 + 0.092 — 0.085 = 0.202 


322 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


where the first two probabilities are obtained from univariate normal 
tables. 
Similarly, for Korwa the deviates are 


h = 0.71 and k = 1.33, r = 0.82 
and the probability for correct classification is 
h<0.71 and k<133 


The tabular value gives the probability 0.085 for h > 0.71 and k > 1.33. 
The probability for wrong classification is 


P(h > 0.71) + P(k > 1.38) — P(h > 0.71, k > 1.33) 
= 0.239 + 0.092 — 0.085 = 0.246 
Similarly, for Artisan the deviates are 
h = 0.85 and k = —0.71, r = 0.45 
and the probability for correct classification is 
P(h < 0.85, k > —0.71) 


Since one of the deviates is negative, the tabular entry for h > 0.85 and 
k > 0.71 has to be obtained, taking r to be negative. For r = —0.44 
the value is 0.013. The probability for wrong classification is 


P(h > 0.85) + P(k > 0.71) — P(h > 0.85, k > 0.71) 


= 0.195 + 0.239 — 0.013 = 0.421 


8b.10 Allocation of a Number of Individuals to Two or More Groups 

Suppose that nı and na posts have to be filled in the Navy and the 
Air Force, a candidate being chosen on the basis of his performance in & 
test. Assuming that the distribution of test scores for those who are 
fit for the Navy and the Air Force are available from past experience, 
how can this knowledge be used for the most efficient sclection? 

In the actual population the relative proportions of candidates suit- 
able for the Navy may be different from nj:no. The procedure for 
selection must be such that, whatever may be the actual proportion in 
the population, the division of a sample of (nı + ng) individuals in the 
assigned ratio 1:2 should involve the least possible errors. A similar 
problem is the allocation of a given number of skulls into two sexes in 2 
given ratio which is determined from some a priori considerations. This 
may be only an estimated proportion and hence may not represent the 
true sex ratio. Whatever criterion is chosen, some male skulls will be 


ALLOCATION OF INDIVIDUALS TO TWO OR MORE GROUPS 323 


classified as female and vice versa. A procedure which gives the least 
value to the expected number of wrong classifications in either group 
may be regarded as the best one. 

There may arise another situation. Two samples of sizes m and na 
drawn independently from the first and second groups may get mixed. 
The difference between the first and second problems is that in the latter 
every sample is known to consist of n, individuals from the first group 
and no from the second whereas in the former no such information is 
available, the sample being drawn at random from a mixed population. 

Solution to the First Problems. Let «1, +++, Xa represent the measure- 
ments on n individuals. As observed earlier, x; will stand for all the 
available set of measurements on the ith individual. The probability 
of the set is 


I [wifi (xi) + Trofee] 


i=1 
Consider the following set of functions i 
Silin tyan) KE a) iSl rn 


Which can be represented simply as ô; 5:’, satisfying the conditions 


6; = Oorl 6; + ôr = 
and 


Pics 
Eô; = n D5 = ne 


naai h 
When r ified numbers. If the individual wi 
e nı and n are the specifie Brier kor Pe 


Measurements 2; is assigned to the first grou £ 9 
second when m =1, n the above set of functions ne aoe z =A 
Sion rule, The problem is then to construct the — fun za um 
that the expected risk associated with this decision rule is a eat 

© calculate the expected risk we need know as a datum ai a ee 

e loss of assigning an individual of one group to aoten, ee 
Sents the loss in assigning an individual of the first group to! i 
and re; the loss in the other case, then the quantity to be min 


€ expected value of 4 
(aa + 8,ai2) (8b.10.1) 
I 


where 

ror mofo(wi) rom ifi(@i) 

1 : pae 
kt mifi(ai) + rrofo(i) 2 mfi) + mofo (wi) - 
ti 

ini i chosen such thai 
n expected loss will be a minimum 1 hee ane 
€ expression (Sb.10.1) has the least v 


f ô; and 6; are 
alue. The pro 


324 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 
as that treated in lemma 2 of Appendix A5 leading to the solution 
6;=1 if an + wy S aig + po 
6f=1 ifan +m > at po 


where u and yp are suitably chosen to satisfy the condition Ds; = nı. 
Now (ai — aig) < (v2 — m) implies that 


To172fo(xi) — rem fi (2) 


m1fi(vi) + 2fo(z;) 


fii) SÄ 

So(zi) ~ 
so that the decision rule reduces to the evaluation of the likelihood 
ratios A, = fi(x,)/fo(z,) and assigning all the individuals with highest 
nı values of the ratios to the first group and the rest to the second. 
Fortunately the decision rule is independent of the a priori probabilities 
and also the loss function. 

Corresponding to every decision rule we can set up a density func- 
tion of the observations by considering all individuals assigned to the 
first group as having been drawn at random from the first group and 
similarly for the second. By using lemma 1 in Appendix A5 we find that 
the best decision rule found above maximizes the corresponding prob- 
ability density. As shown below this forms the basis on which the solu- 
tion of the second problem depends. 

Solution to the Second Problém. In this problem the mixture is known 
to consist of nı individuals drawn from the first group and na from the 


or 


+ 


second. The observations zı, -++, x, could have arisen in (hs ) ways, 
1 


any subset of nı observations belonging to the first group. The prob- 
ability density of the observations is equal to the sum of the densities 


associated with ia ways of splitting the sample. If £a, to =+" and 


Zp, Tq, *** represent a division into two groups of sizes nı and ng, then 
the probability density of the observations can be written as 


P(x, meg En) a 2'f1 (wa)fi (ae). poet fo(xp)fo(2xq) es (8b.10.2) 
where the summation is over ee such terms. Corresponding to any 


one of (2) possible decision rules the loss relative to the given set of 
1 


ALLOCATION OF INDIVIDUALS TO TWO OR MORE GROUPS 325 
observations is 1/P(a1, «++, tn) times 
Z'a, b, -+5 p,q, +++ )fi(ea)fi(ee) «++ fo(tp)fo(tq) +++ (8b.10.3) 


where Ua, b, «++; p,q, +++) is the loss incurred in adopting a given de- 
cision rule when in fact a, tp, +++ come from the first group and zp, 
Ta +++ from the second. The loss will generally be a function of the 
number of wrong classifications only. That decision rule for which the 
expression (8b.10.3) is the least is the best in the sense that it minimizes 
the expected loss. The solution depends on the evaluation of the ex- 


Pression (Sb.10.3) for each of the be) decision rules, and this makes 
1 


the application a little difficult in practice even when n is small. 

As an alternative we may try to minimize the maximum loss incurred 
by following a decision rule. If we suppose that the loss function is 
Proportional to the number of wrong classifications, then the maximum 
loss occurs when all the individuals in the smaller group are wrongly 
classified. With this assumption it can be shown that the division of 
the sample corresponding to the maximum probability density supplies 
the best possible solution to the problem. This solution may be re- 


i i n 
ferred to as the maximum likelihood solution; we consider the k 


š n P 
Ways of splitting the sample as associated with (") different hypotheses 


Concerning the individuals in the sample and choose that hypothesis 


Which has the maximum likelihood. Me. 
To prove the property referred above, consider any other decision 
Tule leading to a division 


Zap Lay ylba Vow * °° and Tey Ten °°» Pay Td» *** 
i re with the 
of the sample into two groups of sizes nı and nz and compare WI 
vision 
Tan Van °° *y Tey Von 7 and Toy Ton ***s Tdi Tan ` 
i i ified in 
*Ssociated with the maximum density. The measurements pa ite 
© Same way by the two decision rules are represented by ae San 
© first group and by ay Tan *** for the second group. PY 


fı (tci) > fi (xo,) 
fe (es) — h (ee) 


(8b.10.4) 


326 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


and the same is true for the product of a number of ratios involving te 
and the product of the same number of ratios involving zp. 

Let no < nı without loss of generality. In this case the maximum 
loss occurs, by following the first decision rule when 


Tey Ten °° *y Ldyy Tay *** (8b.10.5) 
arise from the first group in which case a subset ng out of 

Wary Tan °°, Toy Toy °** (8b.10.6) 
arise from the second group. Let this subset be 

Tais Taj ***y Topi Logs 2" 


By replacing the subscript c by b we obtain the corresponding situation 
for the proposed maximum likelihood decision rule. The difference 
between the above two probability densities associated with maximum 
errors for the two rules is, apart from a common multiplier, equal to 


SiE S Eea) +++ SAES E0) e — SES E) +++ folate, fo(te,) °*1] 


which is not less than zero according to (8b.10.4). By considering all 
subsets of ng observations out of (8b.10.6) we exhaust all possible ways 
in which the observations leading to maximum error according to the 
first rule can arise. To each such case there is a corresponding division 
leading to the maximum error for the proposed rule. But this division 
leads to a smaller probability density. The total chance of maximum 
error relative to the given set of observations is thus a minimum for the 
maximum likelihood decision rule. 

The Problem of Three Groups. As in the case of two groups we con- 
sider two situations, firstly when the sample consists of n individuals 
observed at random from a mixed population and secondly when the 
sample is a mixture of nı individuals drawn from the first group, "2 
from the second, and ng from the third. The problem in either case is 
to select nı individuals for the first group, nə for the second, and ng for 
the third where nı + m2 + ng =n. A 

Let fi(z), fo(x), f3(x) represent the probability densities of x for the 
three groups and 7, 72, 73 the proportions of mixture in the general 
population. The loss in assigning a person to the ith group when, in 
fact, he belongs to the jth group is denoted by rj; The a posteriori 
risks in assigning an individual with measurements x; to the first, sec- 


ALLOCATION OF INDIVIDUALS TO TWO OR MORE GROUPS 327 


ond, and third groups are, respectively, equal to 


_ mefo(wi)rar + aafs(wars1 


mee ba) 
mifi(va)rie + Tafa (2:)r32 

i v(e) 
Tofali) r23 + mifi (2:)rı3 

azi = ————__—_ 


b(w:) 
where 


blæ) = mifi(ai) + Tef2(vi) + rafa(x:) 
Consider a set of functions 
6; = Oorl ô =Oorl ål =Oorl ati + 6" =1 


such that 
Zô; = nı zô; = Nng z8” = ng 


They define a decision rule if the individual with measurements a; is 
assigned to the first group when 6; = 1, to the second when 6,’ = 1, and 
to the third when ô; = 1. The a posteriori risk for such a selection 
Procedure is 


Di (Sirs + 8i/a25 + 5/32) 
1 
The best decision rule is one that minimizes the above expression. This 


18 exactly the problem solved in lemma 2 of Appendix A5. The best 
Solution is 


6 =1 when ay; + 1 < azi + `o, ari + M < azi + As 


ll 


6 = 1 when az; + Az < diz + M, azi + Az < aai + As 
è = 1 when az; + As S arn + An azi + ds S azi + M 


Where M1, Ae, Ag are determined such that 26; = ma, Dö’ = ne, and 
26," = ns. As it stands, the problem of determination of Mu A2, Ag 
appears to be complicated. There jis a geometrical device which is 
helpful in the solution of the problem. In higher dimensional cases 


involving four or more groups the geometrical method cannot be ap- 


328 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 
plied. For three groups we replace ai; az; a3; by two coordinates 
Xi = aui — @i Yi = a1; — ag; 


and represent the n points (X; Y;) on a two-dimensional chart with 
rectangular axes. The problem is to determine a point (Xo, Yo) on 
this chart such that the regions formed by the lines X = Xo, Y = Yo 
and Y — Yo = X — Xo contain the requisite number of points. This 


(na) . 
Third group 


(n2) * 
Second group 


(ni) » 
First group * 


y 
Ficure 5. The arrangement of three thin rods leading to the required division. 


can be done by moving three thin rods OX’, OY’, OZ’ fixed at the point 
0 as shown in Figure 5, with 0X’ and oy’ parallel to the X and Y axes, 
al and error. It will help in 
three regions are recorded for 
me marked on the chart. 


roup, nə from the second, and 
| risk relative to the given set of ob- 
Ing! possible decision rules and choose that 
immum. This is very difficult in practice 
is needed. As before we may choose that 
division of the sample with the maximum 
rule possesses an important property that 


rule for which the risk is a m 
so that a simplified procedure 
decision rule which leads to a 
probability density. This 


PREDICTION FORMULA FOR THE GENOTYPIC VALUE 329 


the probability of the maximum number of wrong classifications for 
any one group is as small as possible. 

The method of arriving at the required division is first to obtain the 
quantities 


Qi = logfile) az; = log fa(x:) ag: = log fa(xs) 


=l yh 
and plot the points 


Xi = ai — aui Yi = agi — a 


and proceed geometrically as in Figure 5. For this division the prob- 
ability density will be a maximum. 

The problems treated in 8b fit in with the general decision function 
theory developed by Wald (1949). 


8c Discriminant Function for Selecting 
Genetically Desirable Types 


8c.1 Prediction Formula for the Genotypic Value 
_ When quantitative characters are involved, it is considerably 
difficult to select genetically desirable types in breeding work because 
heritable differences are to some extent masked by non-heritable or 
environmental variations. The problem then arises as to what is the 
best’ indicator of the genotypic value of any individual line. Suppose 
the desired quality in the plant is yield. The observed yield is no doubt 
a good measure, but, if the factors influencing the yield affect to some 
extent other observable characters. of the plant, then these latter char- 
acters can also be used in assessing the strength of factors responsible 
for yield. This can also be looked on as a problem of prediction: How 
can the genotypic value with respect to some characteristic be predicted 
When measurements on a number of observable characters are available? 
This problem can be extended further to cases where the quality of a 
line is determined not by a single character but by a given linear com- 
Pound of the genotypic values corresponding to a number of character- 
istics. The coefficients of the linear compound are fixed by the relative 
worth of these characters in assessing the quality of the line as a whole. 
Thus, in poultry, the annual number of eggs laid (xı), the size of the 
egg (x2), and the age at maturity (x3) are some of the important factors 
to be considered. To what extent these can be combined in a breed 
depends on the genetic relationships among these characters, and in 
any breeding program best use should be made of the available material. 


330 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


If Yı, Y2, Y3 represent the genotypic values of the three characters 
mentioned above, the breeder’s interest is in the value of a linear com- 
pound ayı + ae~2 + ag¥3 which corresponds to the commercial value 
of the bird for properly chosen values of the compounding coefficients. 
For instance, Panse (1946) considers three sets of weights of which 
one is 


a =8 dg = 5 a3 = —2 


These weights depend on the relative importance to be attached to each 
of these characters and are assigned by the animal breeder. The cash 
return from each bird depends on the number of eggs laid and also on 
the size of the egg. The age at maturity is also important when the 
cost of feeding the bird in the period from its hatching to the date of 
laying the first egg is considered. For this reason the age at maturity 


is given a negative weight. The highest weight is given to the annual 
number of eggs laid. 


To estimate the above linear com 
earlier, to consider other character: 
directly interested. 
(1946) the body wei 
may be written as 


pound it may be useful, as observed 
s in which the animal breeder is not 
In the material on poultry analyzed by Panse 
ght (v4) is also available and the best predictor 


bo + byt fee b44 


where Z1, %2, #3, and Z4 are the mean values for a sire. In fact, any 
number of extraneous characters can be considered in building up the 
prediction formula, and it is not necessary that the formula be linear in 
the measurements although linearity introduces a great simplification 
in actual computation. To determine these coefficients, Smith (1936) 
maximized the regression between the two linear compounds a, + 
Arbo + aspa and bik + bots + bača + bata, which is equivalent to 
minimizing the sum of squares with due weights 

En, (ahir + aavor + aga, — bo — byt, — 
where n, is the sample size on which th 
based for the rth sire and Vir, Yor, 
rth sire. 

The minimizing equations for bi, be, bs, 
6, Bir + boBis + b3By3 + byBy4 = UG + a2G12 + azGi3 
bıB21 + b2Boo + b3Bo3 + byBog = G21 + a2Goo + azGo3 
6, Bs: + b2B32 + bsB33 + b4Bz4 = 
bıBsı + b2B42 + b3Ba3 + b4Ba4 = 


ess Datar) 


e mean values žin, +++, Zar are 
¥ar are the genotypic values for the 


and b4 are 


G31 + a2G32 + azGs3 
G41 + d2Gys + asGag 


PREDICTION FORMULA FOR THE GENOTYPIC VALUE 331 
where B;; is the sum of products between sires and 
Gij = En, Wir — P:i) (Er — &) 
É Dn, fr 


and ī = 
En, En, 


- EnVir 
i= 


The actual values of Gj; are not known, but their expected values are 
obtained from the equations 


E(Bij) = EG) + (k — You 


etween the ith and jth characters 


where o;; is the expected covariance b 
cts within a sire, then an estimate 


within a sire. If S; is the sum of prod 
of Ga is ij produ! 
wGij ~ wBij — (k — 1)Si 
where w is the degrees of freedom within sires, and (k — 1) the degrees 
of freedom between sires. 
a The analysis of sum of squares and products, th 
ij, and other related values are given in Table 8c.1a. 


e estimated values of 


Tape 8c.la. Analysis of Dispersion 


Between (Bij) Within (S;) wByy — (k — Si 


Due to k-1=4DF. w= 201 D.F. = wij 
ay 6,476.4 67,797.38 352,594.2 
x9? 982.10 2,440.14 163,240.14 
z? 37,422 317,982 3,070,074 
xy 335.16 3,660.21 16,124.22 
Tita 678.16 367.83 131,160.54 ` 
rts — 6,043.8 — 35,898.6 = 712,223.4 
zita 750.54 — 172.86 153,278.58 
xorg $25.86 404.01 160,341.72 
rary 248.22 1,246.20 32,445.42 
ryt 708.12 — 1,370.82 161,523.60 
xy? + astrze + agries 67,289.60 616,014.75 4, 901,003.10 * 
ATIT + azta? + agters 8,684.06 14,335.32 1,544,801.58 * 
Qyxy03 + agror + ages 119,065.10  — 921,132.75 11,036 226.60 * 
5,829.18 7,589.76 1,065,408.54 * 


UTITA + agtot4 + asters 
d in two ways: (i) directly from the Gij values 
values of the linear compounds, thus providing 
f the equations for b. Since only 
divided by 1000 and corrected to 
ose in the equations. 


Pa These quantities can be obtaine 
i Ta and also (ii) from Bi; and Sij 
ika eck. These form the right-hand expressions 0 
thr ratios of b are important, these quantities are 

ee decimal places to obtain the same order of figures as th 


332 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


The equations are 


6476.4b; + 678.16b2 — 6043.8bz + 750.54b4 = 4901.003 


678.16b; + 982.10b2 + 825.86b3 + 248.22b4 = 1544.802 
— 6043.8); + 825.86b2 + 37,422b3 + 708.12b, = —11,036.227 
750.54b; + 248.22b2 + 708.12b3 + 335.16b4 = 1065.409 
yielding the solutions 
bı = —0.16274 be = 1.12795 bs = —0.41387 b4 = 3.58233 


Strangely, the character which is scored 
efficient in the discriminant function, whi 
liability of the material. 
ciously low variance, 


quite high has a negative co- 

ich leads us to suspect the re- 

i If, however, body weight, which has a suspi- 

1s omitted from consideration, the new weights are 
bı = 0.33484 bo = 1.57346 bs = —0.27556 

leading to the discriminant function 


0.334842, + 1.573462 — 0.2755623 
against a straight selection function 


821 + 5x2 — Qzz 


er the genotypic value ayy ++ ao¥2 
cen the sires. An answer to this is 
near compound artı + agra + asta 

shown in Table 8¢.2e. The dis- 
S a better estimate of the genotypic 
S unnecessary to test whether the 
timinates between the sires. What 


of its significance. 


8c.2 The Genetic Advance 

Let y be a characteristic as meas 
and variance o”. Let the regressio. 
value of y for a given g is 


ured by a character x with mean y 
n of Y on x be 8; then the expected 


+ B(x — p) 


THE GENETIC ADVANCE 333 


where J = E(y) forall x. If «is normally distributed, then the expected 
value of y for all x exceeding the upper qth part is 


i 1 
q Vro 
where z is the ordinate to the normal curve at xg, the abscissa corre- 
sponding to the upper qth part of the normal curve. A 
Suppose that a large number of plant lines are available and a gth 
part of them is chosen for further propagation by the above method. 
The genetic advance then is 


e 2/20? Boz 
f+ Be — mye dempt 


Boz 


q 


and, since z/q is common for all selection procedures, the intensity of 
genetic advance depends on fz, which is equivalent to cov (yz) / a. 

For selection of poultry the values of the discriminant function at 
the mean values are calculated and the sires corresponding to the highest 
values are chosen. It is seen that, if the means are based on a large 
number of observations corresponding to each sire, then the prediction 
of the genotypic value is more accurate and consequently the genetic 
advance is higher, the maximum genetic advance being available when 
the mean values are known exactly. So in the problem of evaluation 
of the genetic advance it is required to know how much experimentation 
is contemplated to assess the value of each sire. Suppose that the mean 
values for each sire are based on a sample of size n. The genetic ad- 
vance associated with any linear compound 


C121 + Cot: + c3t3 


cov (ayı + Gabo + ass, C11 + Coto + C33) 
VS V (e181 + C22 + cas 


The numerator has the expected value 
¢1(a1911 + 2912 + a3gi3) 
+ ce(aigi2 + a2go2 + asgos) 


+ e3(aigis + a2923 + a39g33) 


334 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


and the square of the denominator V (c11 +- - -+ cag) is 


EXciCjoij 
2Leiesgi; + H 
n 
where o;; is the expected covariance between the ith and jth characters 
within the sires and Jij is the covariance between the ith and jth geno- 
typic values between the sires. The estimated values of gij and cij are 
substituted in the above expressions. These estimates are obtained by 
a direct analysis of variance and covariance of the linear compound 


Cyt, + Cot. + c3z3 and the straight selection function atı + a22 
+ 4323. 

The mean square between sires (S,) for any linear compound ¢12%1 
+ cox + c3z3 has the expectation 


NZZe.c;9;; + Wejcjo3; 
where 


Ta (2n, — En,?/2n,) 
(k — 1) 


In the above expression, ™, N2 +++ are the observations on the first, 


second, --- sires. The mean Square within sires (S_) has the expecta- 
tion 22c,c;0;;, so that 


Si — S. 
Soig S — 83) 
A 
Hence the variance of the mean of the linear compound based on ” 

observations is 
Vlei + coa + e383) ~ Sı — Se A DEcicjoiz 
à n 
Similarly, 
Dı — De 


cov (my1 + aspa + Gaba, CE + cof + cag) = x 


where Dı and Ds are the mean 
sires for the two linear compoun 
+ C323. 


We shall illustrate the method for the linear compound determined 
above. The analysis of variance 


and covariance for the calculated 
function biti + bata + bats = Y, and the straight selection function 
a1 + azt2 + asta = Yo is shown in Table 8e.2a. 


sums of products between and within 
ds, a11 + asto + agt3 and cya, + Cote 


THE GENETIC ADVANCE 335 


TABLE 8c.2a. Analysis of Variance and Covariance of Yı and Yz 


yi? Yè YıY2 
D.F. s.s. M.S. s.s. M.S. S.P. M.P. 
Between sires] 14 | 7,112.875 | 508.0625 | 819,867.3 | 58,561.9 69,004.85 | 4928.92 
Within sires | 201 | 44.449.820 | 221.1433 | 6,842,060.1 | 34,040.1 j| 482,649.77 | 2401.24 
Difference 286.9192 24,521.8 2527.68 
Diference 20.791 1,776.948 183.166 
` 
Within sires 15.357 2,363.881 166.752 
n 
Total 36.148 4,140,829 349.918 


(i) The values are obtained by the following formulae. 
7,112,875 = bı(4901.003) + bo(1544.801) + b3(—11,036.227) (the values used in the 
equations for b from Table 80.la) 


44,449,820 = EEbibjSij 


819,867.38 = a1(67,289.60) + a2(8684.06) + as(— 119,065.10) 


69,004.85 = bil ) + bal ) +b ) 
6,842,060.1 = aı(616,014.75) + a9(14,335.32) + as(—921,132.75) 
482,649.77 = bil ) + da ) + bal ) 


for Ya is 58,561.9/34,040.1 = 1.72 which is on the 5% significant 


(ii) Th io of mean squares as 
see A dom. Only when this is significant can we proceed to estimate the 


level for 14 and 201 degrees of free 
Benotypic value. 


+++, m5 available for k = 15 sires are not 


The actual values of nı, n2, 
t. Let us suppose that the values are 


known in the above experimen 
su 

ch that Bmc En,2/Em) 
a5 rr) 
enetic advance for the value of n = 14.4, which is 


he above example. As observed earlier, we can 
dvance for any n, depending on the intensity of 


= 13.3 


and let us find the g 
the average size 1n t 
calculate the genetic a 


experimentation. i 
The index of genetic 


azı + agt + 4373 1S 
Difference 


a araa e o 
Diference Within 2 /4140.83 


A n 


advance for straight selection associated with 


336 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


The numerator and denominator are both obtained from the column 
under Y3” .in Table 8¢.2a. The expression for bızı + bere + bzzz is 


Difference 
fat Sh 183.166 
——————— SS = —___— = 30.467 
Difference Within 36.148 
— + for Y;? 
A 

30.467 : P 

so that the genetic advance is Ea — 1) 100 = 10.33% higher for 


the discriminant function. It is difficult to test how far this observed 
increase is significant. 


8d Problems of Optimum Selection 
8d.1 A Single Predictor for Dichotomy 


Birnbaum and Chapman (1950) considered a problem of selecting 
candidates on the basis of p admission scores y, »- *, Yp. The object is 
to select those whose performance is expected to be better in the final 
test. The offered solution does not refer to a case where the scores of a 
number of individuals N have been observed but to a hypothetical seb 
of individuals applying for the admission test. The former problem is 
often met with because the question asked is who out of a number of 


individuals whose admission scores are available should be admitted. 
Let the scores of N individuals be represented by 


Yii Y2, ***, Yp 


YIN, Y2N, ***, YpN 


To answer this probleni we need to know the expected performance in 


the final test of an individual with the admission scores Yi; ***; Ypi 
Let this expected performance be 


Ti = (Yri, ***, Ypi) 


which actually stands for the regression equation of the final perform- 
ance on the initial scores. The regression function, which may be of 
any complicated type, supplies us with the expected performances 21, 

-+, ey of the candidates, and these latter scores form the basis for 
selection. The regression function can be estimated on the basis of the 
previous information. 


THE PROBLEM OF DIFFERENTIAL PREDICTORS 337 


For instance, if a given number of k seats are available, then the best 
plan is to admit k candidates corresponding to the k largest values of x 
because this maximizes the expected performance under the condition 
that k have to be chosen. 

A second alternative may be to admit as many as possible with the 
restriction that the expected average performance is not less than an 
assigned number zo. The best plan is then to order the x scores in a 
decreasing order and find the cumulative averages from the top and 
admit all those for whom the cumulative average is greater than or 
equal to 29. Obviously under such a selection procedure the maximum 
number is admitted subject to the condition that the expected average 
performance of the selected candidates is not less than 20. 

Tf the restriction is that the average performance of the chosen candi- 
dates should exceed a given value qo with a probability greater than £, 
then again we start with the highest score of z and go on adding the 
others in the decreasing order of z till the required probability remains 
greater than 6. The calculations are not simple, however. 

If we consider a hypothetical set of candidates, a situation that may 
arise when the statistician is asked to give a uniform rule for independent 
recruitment at various places without specifying the numbers to be 
selected from each place, then what is needed is the determination of a 
critical value a, leading to the selection of all individuals with the ex- 
pected x score (calculated on the basis of the admission y score) greater 
than or equal to za. For this the distribution of the expected score x as 
a function of y has to be studied. Let this be f(z). If the criterion is 
that the maximum number of candidates has to be admitted subject 
to the condition that the expected average performance is not less than 
Xo, then z, is determined from the formula 


MEG dz = zf 1 dx 


Th 


case the expected proportion admitted is 


in which 
pi f(z) dx 
Th 
8d.2 The Problem of Differential Predictors 


In the problem treated in 8b.10 it was assumed that the individuals 
o separate groups characterized by distinct distributions. 
What was needed there was the classification of a collection of individuals 
into the distinct groups to which they belong. But situations arise where 
an individual cannot be said to belong to a distinct group, as when we 


belong t 


338 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


have to judge the relative usefulness of a person in two jobs, A and B. 
To give another instance, it may be necessary to determine on the basis 
of a student’s score in an admission test whether he should be allowed to 
take a course in mathematics or physics. What we need in such cases is 
a set of predictors measuring the success of a candidate in various 
careers on the basis of the initial scores and decide on a profitable course 
of action. The general problem may be stated as follows. 

Out of N applicants, nı have to be selected for the first job, no for the 
second, ---, ny for the kth, given their scores in some suitably designed 
tests. If the number of applications exceed the number of jobs to be 
filled, then the rest of the applications 4, have to be rejected. This 
problem arising out of a study by Brogden (1946) admits a neat solution 
by the use of lemmas in A5. 

Let 2, +++, xp be the scores for p items of a test used to predict the 
success of a candidate in k careers. For a proper treatment of the prob- 
lem it is necessary that success should be measurable quantitatively, 
in which case the success in two different jobs could be compared. For 
instance it may be possible to predict (if past records are available) 
on the basis of initial scores or to find out by direct practical tests, if 
possible, how much worth of goods a person can produce in various 
types of jobs. If it is the admission of a student into one of various 
alternative courses, the success may be measured by the number of 
marks (properly standardized) the student is expected to secure at the 
end of the course on the basis of the initial score. These quantities 
measuring a person’s success in k given situations are represented by 
sı(£), sə(x), +++, s,() which are necessarily functions of the initial 
Scores T1, ++, %p. Let there be N applicants whose success scores are 


given below, with a zero column representing the success score when the 
applicant is not selected for any job. 


$11, S21, +++, 8x1, O 


SIN, Sen, +++, Sev, O 


Let us choose nı values from the first column, no 
< and nky from the last, such that the sum of all 
maximum. The method of determining these value: 
The candidates corresponding to the n; 
are selected for the 7th job. The cand 
values from the last column are not ¢ 


from the second, 
these values is a 
s is given in A5. 
values chosen from the ¿th column 


idates Corresponding to mz41 zero 
hosen for any job. 


A GENERALIZATION OF THE NEYMAN-PEARSON LEMMA 339 


Appendix A 


A1 A Lemma of Neyman and Pearson 
Let Fo, Fi, +++, Fm be a set of integrable functions defined in the 
whole space of (xı «++ £n) and w any region such that 


[Fea =C; 4=1,2,-++;m (A1.1) 


where dv stands for the volume element dx, -++ dz, and C; are assigned 
constants. Let wo be a region within which 


Fo > Fy +++ kmFm (A1.2) 
and outside which 
Fo < kiFi Hiit kinF'm (A1.3) 


where k; are determined such that wo satisfies (A1.1). 
The lemma states that for any region w satisfying (A1.1) the following 


relationship holds. 
f Fo Ww < f Fo dv 
w wo 


Let the common part of the regions w and wo 
region w — wwo is the part of w not common to Wo. 


(A1.1) it follows that 


f Fiw = ji Fi dv (41.4) 
mee ais 


Consider the difference 


a= frw- frw Toa — f Fo dv 
wo w w—wwo 


wo—wwo 


> Ji (SkiF;) dv — f (Zk:f:) dv 
‘wo wwo eee 


=0 due to (A1.4) 


be denoted by wwo. The 
From condition 


Hence the lemma is proved. 

an-Pearson Lemma 

exclusive regions covering 
he integrable functions 91, 


A2 A Generalization of the Neym: 


(i) Let Ry’, Re’, +++ be a set of mutually 


the whole space such that with respect to t 


340 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


gz, +++ the values 


f gi dv = sj; (42.1) 
R 


i 


are constant. 
Gi) Consider the system of regions (N = defined by) 


Re N Fe <F, s=1;2, 


k=1,2 (42.2) 
where 


Fr = or + Margi + Meogo +++ 
$r being some assigned functions. We now prove that the value of the 


integral 
f twf $2 dv +-+- 
Ry Ra 


subject to the conditions in (A2.1) is a minimum for the set of regions 
defined in (A2.2). 


The intersection of the regions R; and Rj’ is represented by Ry. It 
follows from definition that 


Fiw < f Pav + f Fə dv +... 
Ry Ru Riz 


Writing down the above relationship for all R; and adding, we obtain 


Fy dv + Fadte f Fido [Py dv te. 
Ry Ro Ry’. Ry 
or 


1 dv +f go dv +--+ ZZVij8;i z 
Rı R2 


af tr dv+ f gado ++i BErss5 
Ri’ Ri! 


because of the conditions in (A2.1). 
Suppose that the sum 


d Ui ads 
Jie v+ f en v+ 


has to be maximized. Then the regions are 
Ri N Er 5, Tr S, e 
Rə N Fs > Fi, Fa > Fy, = 


The above result is established. 


A SLIGHT VARIATION OF LEMMA Al 341 


Suppose that no conditions such as (A2.1) are specified. Then the re- 
gions are 
Rı N $1 = $2, $1 È $3, °° 


Re N $2 = $1, $2 > $3, °° 


A3 A Slight Variation of Lemma A1 


Let Fo, Fy, +++, Fm be a set of integrable functions such that, with 
respect to a positive function p(x) < 1, 


{roe bat fel i (43.1) 
S 


over the whole space S. 
Consider the special form of the function p(x) 


p(x) =0 when Fo < ky Py +°- kmEm (43.2) 
si when Fo > kifi +++++ din Fin 
where ky, +++, km are determined to satisfy the condition (A3.1). Out 


of all p(x) the integral 
if Fop(x) dv 
Ki 


is a maximum for the special form chosen in (A3.2). 
Let D be a region inside which Fo > kıFı +--+ kmFm and outside 
which the reverse relation holds. For any general p(x) 


f Fop(a) dv = f Fop(t) do + f Fod 
< f Fop(z) dv + f _ PhP) do 
= f Fpl) do + f (SkiF,)p(z) do — f (SkiF,) p(2) dy 
= f roe) ao + f (Bk ILL — p(2)] dv 
D D 


< i Fop(x) do + f Fell — p(2)] dv = Í Fado 


342 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


A4 A Lemma on Power Functions 


Let fi, fo, +-+- be a finite number of probability densities alternative 
to fo which is specified by the null hypothesis. Let w be any region 
satisfying the conditions 


ft dv=a (A4.1) 
w 
and 
it 1 
— [na=— fia =... (A4.2) 
ai Yw a2 Yw 
where a, dz, +++ are positive assigned quantities. 


Out of all regions satisfying the conditions A4.1 and A4.2, the region wo 
Inside which fo S Mifi + rofe+-++ 
Outside which fo > Mfi + Mfo +- 


where i, dz, +++ are determined such that the above conditions are 
satisfied, gives the highest common value to the quantities in (A4.2). 
Proof. Let 8 and Bo be the common values (A4.2) associated with 


the regions w and wo, and denote by wwo the region common to w and 
Wo. Then we have 


(Aiar + Asda +++ +)Bo -f Afi + Mfo +) dv 


e af haws f Afi + Mofo +++) dv 
=f otf aitanta 


sf Afi + refo +--+) dv 
+f Afi + reso +--+) dv 


= fou + Afo +++) dv 


= (Mar + Ma +- --)8 
Tf Mai + `a +--+) is positive, then Bp) > $. To prove that (M141 


TWO LEMMAS USEFUL IN CLASSIFICATORY PROBLEMS 343 


+ zao +--+) is positive we observe that 


J out tahte fhd 


i.e., 

(ray + Mao +°++)Bo 2 a 
Since By and a are positive, it follows that (Aya: + ħza2 +--+) is neces- 
sarily positive. The lemma is proved. 

This lemma gives us a method of determining a region with respect to 
which the powers of the various alternative hypotheses are in assigned 
ratios, and, subject to this condition, every alternative hypothesis has 
the maximum power. 


A5 Two Lemmas Useful in Classificatory Problems 
Consider an array of elements 


Qi Go, *** Apt 


Qy2 Q22 **° Ape 


Qn Aan *** Apn 


consisting of p columns and n rows. Let P denote the product and S 
the sum of n elements chosen one from each row such that the total 
number of elements coming from the first column is equal to a specified 
non-zero value nı, from the second nz, and so on from the pth column 
Np. Obviously n > p and n = my +-+ np. > 

Lemma 1. If the elements a;; are not negative and if there exist 
quantities \y, «++, Ap such that each element aix of the n; elements chosen 
from the ith column satisfies the relationships 


Ain 2 Njüjk a EDE] (A5.1) 


and if similar relationships are satisfied for all i = 1, ---, p with the 
same set Aj, -**, Ap then for this choice of ni, ++, np elements the 
product P defined above is a maximum. 

To prove this consider any other choice of n; elements 


Aimy Vimy *** 


from the ith column. If we remember that the subsubscript refers to 
the row number, the elements from the mth, math, ++- rows occurring 
in the proposed selection may be represented by 


Qim ima *** 


344 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 
from which by definition it follows that the product 


Qimis Qima *** 


is not less than 

PE E, T 

eae inns 
Since each ) is positive by definition, division by à does not change the 
inequality sign. Considering all the groups of elements in the second 
selection we find 


Ai A 
TI trimiti e >IT —...a 
S 


soe E EE 
im ims 
i Ny Ai 


= II Qim Aim *** 
i 


since 
ae 
i Ny Ais 
the terms in the numerator canceling with those in the denominator in 
the final product. 


Corollary 1.1. 
the restriction on 
then obviously the 
each row. 

Corollary 1.2. The product P will 
relationship (A5.1) is reversed. 

* Corollary 1.3. If p = 2, the meth 
evaluating the ratios 


If the object is to maximize the product P without 
the number of elements coming from each column, 
best procedure is to choose the biggest element from 


be a minimum if the inequality 
od described in lemma 1 reduces to 


Qi Ay An 
ay oo ae 
a21 Q22 don 
and arranging them in descendin 


elements in the numerator from 
the denominator from the second 


g order of magnitude and choosing the 


the first ny ratios and the elements in 
Nz ratios. 


Lemma 2. If there exist quantities #1 ***, up Such that each element 


ay, of the n; elements chosen from the ith column satisfies the relation- 
ships 


Qir + martua j= l, p 
and if similar relationships hold for all ¿ = 1, + 


Hi, ***, Hp, then for this choice of n, 
above is a maximum, 


*, p with the same set 
` ` *, Np elements the sum S defined 


TRANSFORMATION FOR MULTIVARIATE COMPUTATIONS 345 


The result follows from lemma 1 by considering exp (a:p) and maximiz- 
ing the product. The existence of 41, ***, Mp leads to the existence of 
positive quantities \;, ---, Ap used in lemma 1. The sum S is minimized 
when the reverse relationships hold good. 


Appendix B 


B1 Ona Transformation Useful in Multivariate Computations 

In multivariate analysis one is often confronted with the task of in- 
verting a covariance matrix which is laborious when the number of 
variates exceeds four or five. This and the further use of the elements 
of the inverse matrix in the computation of statistical constants and 
test criteria can be considerably simplified by working with a set of 
transformed variates derivable from the original variates. The method 
of construction of these transformed variates and the mechanization it 
introduces on the computational side are given in this appendix with 
special reference to the statistical methods used in Chapters 8 and 9. 

Let tı, +++, tp be the original variables, and \,; the covariance be- 
tween the ith and jth variates. The transformed variables ¥;, Yo, «++ 
are defined by 


Yy=% 
Yo = t2 — ani Yı 


Y3 = z3 — a32Y2 — a31¥1 


Yp = Xp — App Yp—1 —***— Gn V1 


The constants a;; are chosen such that Y; are independent. The actual 
evaluation of these coefficients is carried out in successive stages so that, 
if the coefficients in Yı, ---, Y; are known, any coefficient in Y;41 can 


be calculated in a simple B 
To find azı, the covariance of Yı, Y2 denoted by cov (Yı Y2) has to 


be equated to zero. 
cov (¥1¥Y2) = cov (x12) — aa V (Y1) = 0 


= dor — a2 = 0 


V(Ye2) = M2 — M21021 


346 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


where V denotes variance. For Y3, azı and age are to be calculated in 
order. With the constants b;; as defined below introduced merely to 
facilitate computation, the steps may be given as follows 


b31 
b31 = Agi a3, = 
V(¥1) 
b32 
b32 = Age — a21b31 a B = eon 
V(Ye) 


V(¥s) = Aas — bgiagi — bz2a32 
To find Y4, the steps are 
b41 
V(Y:) 
b42 
V(¥2) 


bar = Nar = 


baa = M2 — az1b41 as = 


baz 
bas = M3 — agaba — gids, agg = ——— 


V (Y3) 


V(Y4) = Ma — bs1a41 — basas — b43043 


With Yı, ---, Y;1 known, the steps for the evaluation of Y; are 
bia = da ay = gee 
V(¥1) 
big = M2 — darby liz = Eon 
V(FY3) 
= big 
bis = Aig — agabis — azıb ai = — 3 
V(Y3) 
: b 
by = Ay — D Oba aj = — 2 j <i— 
fara 7 V(Y;) fae 
i-1 
VV) = dx — DY aibi 


j=1 
The method needs checking at each stage since the constants derived 
at any stage depend on those previously calculated. 


; s J Errors may accu- 
mulate due to rounding off in earlier calculations, but 


the accuracy can 


AN ALTERNATIVE COMPUTATIONAL SCHEME 347 


be maintained by retaining a sufficient number of decimal places at 
each stage. 

It is unnecessary to express Y as a function of x only. This would 
mean another set of successive operations starting with Y, = 2 and 
substituting for Y, in Yə, for Yı, Yə in Ys, and so on. In any problem 
Yı, --:, Y; will be successively calculated, and for this the transforma- 
tion derived above can be directly used. If Y; has to be directly calcu- 
lated from the original measurements, then the computational method 
given in B2 is much simpler. 


B2 An Alternative Computational Scheme 

An alternative method which directly yields the functions of x is sug- 
gested by the following theoretical considerations. Let the dispersion 
matrix of 21, To, +++, tp be 


Consider the extended matrix 


Mi ee Mp T 


Ap te App Tp 
Taking à; as the first pivotal element, replace the first row by 


Are Ap Tı 
Mut Mu M 


Sweeping out the first column and using the first pivotal row, we ob- 
tain the reduced matrix 


+ 
doo" op Xo! 
+ t 
Ap? App Tp 
where 
Au Aa 
Ai Ag — A oe = te 1 


348 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 
Now 


2); lal? 
Vz!) = Væ) — — cov (wt) + (5) V(x) 
Aun A11 


Similarly, 

cov (x;'x;’) = dj 
This shows that the reduced matrix at any stage is the dispersion matrix 
of the new variables on the right-hand side, provided that the first 
matrix is the dispersion matrix of the original variables. This property 


has been discussed in 3a.6 in connection with the solution of normal 
equations and their intrinsic properties. Also 


Ài 
cov (xix;’) = cov (x12;) — $ V (a1) 
11 


= An — An =0 
so that the new variables are all uncorrelated with the variable of the 
pivotal row. We now consider the second pivotal row 


i Aag” Nop! t 


Noo’ Naa! gg! 
and find the further reduced matrix 


Agg” bday Nap” za” 


Po ipi n , 
p3 App” £p 


We thus obtain the variables 


tı, Za Sal’ ate 
with variances 


Mi Avo’, Agg”, +++ 


They are all mutually uncorrelated as sho: 
depends on z; and zz, and T3” on x, 
the transformation is of the type co 
relatively simple scheme for obtaini 
the original variables, provided th 
included are fixed in advance, 
variables are calculated one afte: 


wn above, and further 22’ 
T2, and zz, only, and so on. Thus 
nsidered in B1. We thus obtain a 
ng uncorrelated linear functions of 
at the number and variables to be 
In the earlier method the transformed 
T the other so that we are free to choose 


REFERENCES 349 


the variable to be added at any stage and in any order we like. There 
are a few problems where the decision to add a new character depends 
on tests to be made with the help of the transformed variates up to that 
stage. In situations like this only the earlier method is open to us. It 
is enough to compute the transformation (B1.1) in such a case since 
successive values of Y;, Yo, --- will be obtained. There is no need to 
express Y as functions of x only. In problems where a transformation 
of a chosen set of correlated variables is required, the alternative method 
of B2 is better. 

Having obtained Yy, Yo, +++, Ys (say) directly as functions of the 
original variables, if we want to extend the transformation to a sixth 


variable xg we write 
Yo = te — A5Y5 — a64 Y4 — ae3 Y3 — ao2Y2 — ao ¥1 


as in the earlier method. The coefficients are determined from the 


equations 
cov (t6Y;) = agiV(¥%) 


Since Y; is a known function of a, it is easy to calculate cov (x6 ¥;) and 
V(Y;) is already available. 


References 


Brrnzaum, Z. W., and D. G. Camarma (1950). On optimum selections from multi- 
normal populations. Ann. Math. Stats., 21, 443. Es 
Brocpen, H. E. (1946). An approach to the problem of differential prediction. 


Psychometrika, 11, 139. F- : , 
Farrrietp Smıru, H. (1936). A discriminant function for plant selection. Ann. 


Eugen. London, 7, 240. : 

Goopwin, C. N., and G. M. Moranr (1940). The human remains of Iron Age and 
other periods from Maiden Castle, Dorset. Biom., 31, 295. 

Hoorr, B. G. E. (1926). A third study of the English skull with special reference 

, B. G. E. y 
to the Farringdon Street crania. Biom., 18, 1. ; ; 
Jerrreys, H. (1948). Theory of probability. Oxford University Press. Oxford. 
MARTIN, E S. (1936). A study of an Egyptian series of mandibles with special refer- 
, É. 5. i! : 
ence to mathematical methods of sexing. Biom., 28, 149. i 

Miszs, R. V. (1945). On the classification of observation data into distinct groups. 
Ann. Math. Stats., 16, 68. 

Moranz, G. M. (1926). A first study of craniology of England and Scotland from 
neolithic to early historic times, with special reference to Anglo-Saxon skulls in 
London museums. Biom., 18, 56. , i ) 

Neyman, J., and E. S. PEARSON (1928). On the use and interpretation of certain 
test criteria for purposes of statistical inference. Biom., 20A, 175 and 263. 

Neyman, J., and E. S. Pearson (1933a). On the problem of the most efficient tests 
of statistical hypotheses. Philos. Trans. Roy. Soc. A, 231, 281. 


350 STATISTICAL INFERENCE IN CLASSIFICATORY PROBLEMS 


Nerman, J., and E. S. Pearson (1933b). On the testing of statistical hypotheses 
in relation to probability a priori. Proc. Cam. Phil. Soc., 29, 492. 

Panss, V. G. (1946). An application of the discriminant function for selection in 
poultry. Jour. Genetics (London), 47, 242. 

Pearson, K. P. (1894). Contributions to the mathematical theory of evolution. 
1. Dissection of frequency curves. Philos. Trans. Roy. Soc. A, 185, 71. 

Pearson, K. P. (1930, 1931). Editor. Tables for statisticians and biometricians, 
Parts I and II. Cambridge University Press. 

Penrose, L. S. (1947). Some notes on discrimination. Ann. Eugen. London, 13, 
228. 

Rao, C. R. (1948). The utilization of mutiple measure: 

Jo classification. J.R.S.S. Suppl., 10, 159. 


Ao, Č. R. (1950). Statistical inference applied to classificatory problems. Sankhyd, 
10, 229. 

Rao, C. R., and P. SLATER (1949). Multivariate analysis applied to differences be- 
tween neurotic groups. British Jour. Psychology, Statistics Section, 2, 17. 

Warp, A. (1949). Statistical decision functions. Ann. Math. Stats., 20, 165. 


ments in problems of biologi- 


CHAPTER 9 


The Concept of Distance 
and the Problem of Group Constellations 


9a Distance between Two Populations 


9a.1 The Need for a Distance Function 

One important object of obtaining biometric measurements is to study 
the possibilities of classifying different groups of individuals in the form 
of a significant pattern. Here we are concerned not with the individual 
variations within a group which played a prominent part in the investi- 
gations of Chapter 8, but with the group characteristics or the statistical 
constants related to the distributions of measurements. The configura- 
tion of several groups or, to be more precise, of the group characteristics 
may admit a description in terms of a few group constellations and their 
interrelationships. The groups within a constellation must necessarily 
be closer, in some sense, to one another than those belonging to different 
constellations. Such a description, based only on measurements, quanti- 
tative or qualitative in character, may be of use in the study of evolution 
of the various groups. 

A word of caution is necessary. Although it is possible to refute 
any statement concerning the relationships of some groups, it cannot 
be asserted that any closeness as indicated by a study of measurements 
alone is due to some common stock from which the groups have evolved. 
Historical and ethnological evidence and also geographical contiguity 
of localities inhabited by various groups have to be considered in inter- 
preting the observed differences. 

The first step in the problem of group constellations is the construction 
of an index by which we can measure the resemblance between two 
groups. With such an index it is possible to speak of a generalized 
distance between two groups and to compare the distances between any 
two pairs of groups. We may, then, be able to say that groups G; and 
Ge resemble each other more than Gz and Gs or G3 and G4, and so on. 

351 


352 DISTANCE AND GROUP CONSTELLATIONS 


If groups G, and Ge are close together and Gs is distant from both, we 
can talk of G; and Gs as forming a cluster. It may be that all the 
distances between Gi, G2, and G3 are small but the distances of these 
from the others are large. Then G1, G2 and G; can be considered to be 
a closely associated cluster of groups. By sorting out such clusters it 
may be possible to arrange the various groups in some simplified pattern. 


9a.2 Mathematical Concepts (iscriminatory Topology) 

In 8b statistical criteria were developed for specifying an individual 
as a member of one of two groups to which he can possibly belong. 
These criteria depend on the evaluation of the likelihood ratio and on 
the assignment of all individuals providing a ratio higher than a pre- 
determined value A to one group, and the rest to the other. Errors are 


inevitable in such a procedure, and for any given the chances of 
incorrect classification of individ 


increase in the divergence between the two groups, 
are distinct in the sense that the ranges of measu 
overlapping in the two cases, the 
so that no error is committed. 


If the two groups 


a so that the zero value of a 
One such function is (1 — a). 
This satisfies the two fundamental 


postulates of distance in topological 
spaces: 


(i) The distance between two groups is not less than zero, 
(i) The sum of distances of a group from two other groups is not 


less than the distance between the two other groups (triangle law of 
distance). 


e æ <1, in which case the distance 
1—a2>0. To prove the second, we consider three groups, Gy, Ge, G3. 
Let R,(1, 2), Re(1, 2) be the best divisions of the space corresponding to 
Gy and Gz. Similar definitions hold for Rı(1, 3), Rs(1, 3), R2(2, 3), 


The first postulate follows sine 


MATHEMATICAL CONCEPTS (DISCRIMINATORY TOPOLOGY) 353 
R3(2, 3). Defining 


f Ki d =1— Qij 

RGA) 

the proposition required to be proved may be stated as follows. 
(1 — a2) F (1 — aœ) + (1 — a3) 


From definition it follows that 


fiw <f Ji 


Rii,3) ii) 


so that 
f fido + f fd £1 
Rili) Ri) 
Hence 
f fidupl> fidu+ fg dv 
Rx(1,2) Rx(1,3) R3(1,3) 
and 


hwsisf fat ha 
Ra(1,2) Ra(2,3) R3(2,3) 

Adding, 2(1 — a2) > 2(1 — a13) + 2(1 — a23), which proves the de- 
sired result. The distance function defined above must satisfy some 
further empirical requirements if it is to be of any value in biological 
classifications. 


(i) The distance must not decrease when additional characters are 
considered. 

(ii) The increase in distance by the addition of some characters to 
a suitably chosen set must be relatively small so that the group con- 
stellations arrived at on the basis of the chosen set are not distorted 
when additional characters are considered. 


The first requirement is reasonable since adding some characters to 
a basic set must necessarily reduce the errors of classification. In fact, 
this requirement is satisfied when the distance function is as chosen 
above. Let Pi(tı, +++, tp) and Pa(t1, +++, tp) denote the probability 
densities of two groups with &,(p) and Rə(p) as the best divisions of 
the p-space. When an additional character is considered, the prob- 
ability densities can be written Pi(t1, ***, p41) and Po(m, ++; Xp41) 
with R,(p + 1) and Re(p + 1) as the best division of the (p + 1)-space. 
If 2 denotes the region obtained by considering Ry (p) and the complete 


354 DISTANCE AND GROUP CON: STELLATIONS 
range for xp41, then by definition it follows that 


Í Polti, +++, Ep41) do’ < f Polan, tty p41) do! 
Ri(p+1) 2 


= Polti, +++, Xp) dv 
) 


so that if æ and a’ represent the proportions of overlapping individuals 
in the two cases it follows that a’ < a, which proves the result. 
The second requirement has been introduced merely as a practical 


number of characters used in order to arrive at stable judgments. This 
should be empirically verified in any situation. 
9a.3 Mahalanobis’ Generalized Distance 


Consider two multivariate normal populations with a common dis- 
persion matrix (Aj) and having mean values Mil) ***, Mp1 and py, +++, 
Hp2 50 that the probability densities fi(z) and fo(x) are 


Jı = const. exp {— 222M! (z; — ki) (tj — yj1)} 


Ja = const. exp {— FEIA (e; — Hi2) (zj — uja)} 


where (A¥) is reciprocal to Ag). 


The surfaces of constant likelihood 
ratios are defined by 


L(x) = 2 ex Ad;)x; = const, 


where d: = pa — pio, (i = 1, ***, P). The common value of the least 
proportion of wrong classifications is 


f fo dv =f fi dv 
Liz)>a Li 


(2) <a 


This can be determined by considering only 
which, being a linear function of x, is normal. 
first group is 


the distribution of L(x), 
The mean L(x) in the 


Lu) = 5 d) uj 


COEFFICIENT OF RACIAL LIKENESS 355 


and in the second group 


L(u2) = 2a (Edu 


V{L(x)} = D? = SN did; = L(y) — L(u2) 


a dL(x) 


bog f en HILE -L/D 
L DV 2r 


(z) <a 


By symmetry it follows that 
nies (uy) + Lue) 
2 


On making the transformation y = [L(x) — L(u1)]/D, the above integral 
reduces to 


= 1 É em" d 
"Viele 
and hence 
i a" 
l-a= ee e dy 


which shows that (1 — a) is an increasing function of D. The measure 
(1 — a) may then be conveniently replaced by the functionally depend- 
ent quantity D,* which was first introduced by P. C. Mahalanobis. 
The above analysis supplies a logical derivation of this tool suggested 
on intuitive considerations. It also shows that Mahalanobis’ generalized 
distance function is applicable only to groups in which the measurements 
are normally distributed. 


9a.4 Karl Pearson’s Coefficient of Racial Likeness 

In 1921, in a paper by Miss M. L. Tildesley, Karl Pearson proposed a 
measure of racial likeness (C.R.L.) which has been used since then by 
anthropologists of the biometric school for purposes of classifying skeletal 
remains (Morant, 1923). If mi; and mg; denote the number of observa- 
tions on which the means m;ı and m; of the ith character for the first 
and second groups are based, s; is the standard deviation of the ith 


* D is the ordinary Euclidean distance in a space defined by a set of oblique axes. 
Therefore it also satisfies the triangular law of distance. 


356 DISTANCE AND GROUP CONSTELLATIONS 


character, and p is the number of characters used, then C.R.L. is defined 


b 
j 1B MMi (= = ma) 
P ii Nii + Noi Si 
This is meant to be an estimate of a measure of distance between two 
populations. Since this estimate depends to a large extent on the sample 
sizes a reduction factor is employed for comparing the C.R.L. values 


arising out of two pairs. The reduced coefficient of racial likeness 
(R.C.R.L.) is 


ty + ño 1 l P MiM; (= — = i} 


MMe P lizi Nii + no; Si 
where 
- _ ny _ Inu 
ni = — ig = — 
Pp p 
It may be seen that 


BRORL) = 12 Hie S mma (es - aa 
p 


niño I Nyi + no; 03 


where u; and pig are the population mean 


si is not subject to variations. This is not independent of the sample 
sizes unless ni; = ñ, and ng; = ño for all i. 


With skeletal material there is a good deal of variation in the number 
of observations available for various characters 


values and it is assumed that 


group characteristics, the 
cording to any criteria; the 
observations available. This, 


in the estimate of D? is not very seriou 
necessary, a8 shown later, by subtracting some value which depends 
solely on the sample sizes and whose value is negligible (tending to 
zero) in large samples. Comparability 


l is retained because the weights 
attached to the various characters are not functions of sample sizes. 


CALCULATION OF D? 357 


C.R.L. differs from D? in another important aspect. In C.R.L. all 
the characters are treated as independent. All biological experience 
points to the fact that some amount of correlation exists between any 
two characters, and the effect of this is to make C.R.L. increase 
rapidly with an increase in the number of characters. It is difficult 
to say from the changes in C.R.L. whether some newly added characters 
supply additional information for purposes of discrimination. It was 
seen in the examples of Chapter 8 that, if differences in some characters 
have already been considered, the absolute differences in some other 
characters do not throw any light on discrimination between the groups. 
There is no provision in the construction of C.R.L, to distinguish which 
characters are more useful for discrimination. For instance, when a 
character highly correlated with a set of characters is used in addition 
to those in the set, the C.R.L. may be considerably altered although the 
change may be inappreciable when the high correlation is taken into 
account. The change in D? may not be appreciable if such superfluous 
characters are used. If any character alters D? considerably, it may be 
taken to be of additional value in discrimination. In view of the fact 
that the D? statistic satisfies some logical requirements, it may be used 
in preference to the C.R.L. 


9b An Illustrative Example 


9b.1 Calculation of D? 

The mean values of 9 characters for 12 castes and tribes of the United 
Provinces (Mahalanobis, Majumdar, and Rao, 1949) are given in Table 
9b.læ, and the pooled intragroup correlations and standard deviations 
in Table 9b.18. The first step in the analysis is the evaluation of all 
possible D? values. The formula 


D? = D2"d,d; 


for the computation of D? is not useful since it requires the inversion 
of a ninth-order determinant and then the evaluation of 9(9 + 1)/2 
terms whose sum is D?. In an earlier example, the classification of the 
Highdown skull, it was found convenient to work with a set of un- 
correlated characters constructed from the original measurements. The 
D? with such transformed variables reduces to the evaluation of a simple 
sum of squares. Two simple methods of obtaining this transformation 
are given in Appendix B of Chapter 8, and one method is actually 
illustrated in the problem of the Highdown skull. The other method 
is followed in the present example. If the original characters expressed 
in standard deviation units are represented by lower-case letters, then 


DISTANCE AND GROUP CONSTELLATIONS 


358 


89° TOT | 98 
PV GOI | EZI 
o9°SOL | F6 
9Z'ZOT | 89 
OL°FOT | STI 
9E'E0T | ZST 
ST°OOT | ter 
FE°66 =| OST 
8c" EOI | Z2 
86'E0I | GET 
OF FOL | 66 
FL°FOL | 98 
uvo | u 
YIpPveg_. 
[BuO 


ZE I8 
60°€8 
IF’ €8 
ce ts 
GOL FS 
€o°S8 
GF FS 
£8" Is 
83° 38 
C398 
£F" 98 


uvo 


u 


4q310H 
3U} 


p. 


siojyovIsyy pur sdnory Aq sənjey ULIN DIAG TIAV I 


€S°OoT | ZS 
PE'I9I | lt 
SE“TOI | F6 
Le* Ist | 29 
SS°OOr | SIT 
c6°S9T | 98I 
IG'FOI | FSI 
SE'E9T | GFI 
S#'’Z9T | Sor 
EE'E9I | Ger 
ZO°SOT | Z6 
Ig°*FOT | S8 
ULIN u 
91N48}S 


P8'EG | ZS 
€Z°es | TZI 
£O°tS | F6 
66° FS | 89 
£E | sit 
cO°FS | OST 
GI'FZ | Fol 
60°Ss | OST 
OFFS | SOT 
Elbe | GET 
FPL’ | 16 
GPSS | 98 
uvo | u 
ydq 
[SSN 


T9°9€ | ZS 
2698 | ELT 
189g | }6 
09°S£ | 89 
TI'’8E | EIT 
6F`ZE | Z8T 
c8"Se | Fel 
c9o°sEe | OST 
9€°9E | SOT 
F9°SE | LET 
€T°9€ | 06 
S9'9g | 98 
uvayy | u 
qIpreg 
[BSu N 


Z9'8F | LS 
oL'Sh | ELI 
326 | $6 
86'8F | 89 
FE"OS | SIT 
09'S | LST 
0£g'OS | FZI 
90°SS | OSI 
8£'IS | 89I 
L'Z | LET 
OF'OS | 16 
FSI | 98 
uvoyy u 
1430H 
IPSUN 


OLOgT | 29 
og*Iet | SZI 
Z8S'TEI | 76 
OL'TET | 89 
P9'ZET | SIT 
ST'ISIT | 981 
9T'IEI | Far 
eg*ger | OST 
ZS'TIET | 891 
OZ'IST | 68T 
89°EI | Z6 
9E'EET | 98 
usə W u 
yypvarg 
onwuosAzIgT 


—— 


8Z°98T | ZG | E8'88I | ZS (FY) 184eN 
P8°OSI | SLI | 69'281 | EZI (£V) uvstzry 10430 
98°ZEI | #6 | 98'88I | 6 (y) rumy 
ZI'’SEI | 29 | SF'ZSI | 89 (Ty) 14V 
Z9'ZET | SIT | OF'’98I | SIT (a) woq 
c9° LEI | 981 | Z8°IST | LST (4a) rua 
OF'ZET | EZI | FE'OST | FZI (@D) naqey 
8S°SEI | OST | OL*OST | SFT (TO) neq 

OF’ ZEI | SOT | SZ°O6I | Z9T (JN) SNN 

cL’ Isl | 6ST | SS°S6T | GET (q9) uY8YO 

OS'GEI | Z6 | SE'I6I | 26 (Eg ‘1043Q) wurg 

S8'6ET | 98 | Z6'T6I | 98 (Tg ‘nseg) urMyvig 

usojy u uvsəð u 

dnoiry 
qppVaig qysue'T 
pve pve 


CALCULATION OF D* 


359 


TABLE 9b.18. Pooled Estimates of Correlations and Standard Deviations 


devi- 
ation 


the transformed variables y1, yo, 
which have standard deviation unity, are as given below. 


HL 


HB 


B-B 


NH 


0.1982 0.2792 0.1758 
0.5407 0.1735 


0.1852 


NB 


ND 
0.1537 


St 


0.2698 
0.1927 


SH 


0.2651 
0.2069 


FB 
0.2270 
0.4461 
0.4930 


rr O Ř————  ——————" u 
— 


6.60 4.50 


a= 


0.980162y2 = 
0.822702y3 = 
0.970893y4 = 
0.953943y5 = 


0.944814y, = 


0.924222y, = 


0.826753y9 = 


Y, = 


4.58 


hl 


3.50 


2.57 


Yo = hb — 0.198200Y, 


Y3 = 
ae 


Y; 


Y6 


Y7 


Ys = 


Yo = 


b-b — 0.505209FY> 
nh — 0.097610Y3 
nb + 0.023239Y,4 


— 0.193000Y, 


nd — 0.069643Y;5 
— 0.104439 Yo 
st — 0.084080 Y6 
— 0.211917Y3 
sh — 0.507938Y7 
— 0.141858 Y 4 


— 0.265100, 
fb — 0.162016Ys — 0.054612Y7 — 0.027845 Y 6 
— 0.044106Y; + 0.006423 Y,4 — 0.335334Y3 
— 0.417533 Y> — 0.227000Y; 


— 0.279200Y, 
— 0.144326 Y> — 0.175800 ¥, 
— 0.246667Y3 — 0.107261Y2 


3.20 


3.92 


*-+, Y9, Which are all uncorrelated and 


— 0.258066, — 0.094404Y3 
— 0.153700Y; 
— 0.045612Y; — 0.122926Y, 
— 0.144918Y. — 0.26980071 
— 0.115274, — 0.018904; 
— 0.217927Y3 — 0.160669Y2 


360 DISTANCE AND GROUP CONSTELLATIONS 


The normalized mean values of the characters are given in Table 
9b.1y, and the mean values of the transformed characters y1, +*+, yo in 


TABLE 9b.ly. Normalized Mean Values of Characters 


Group hl hb bb nh nb nd at ah fo 


Brahmin 
(Basti, B1) 0.534 0.423 0.310 0.293 0.070 0.566 0.238 0.815 0.508 


Brahmin 
(Other, B2) 0.448 0.338 0.161 0.053 —0.093 0.137 0.336 0.759 0.436 


Chattri (Ch) 0.634 0.165 —0.053 0.715 —0.284 0.131 0.033 —0.491 0.314 


Muslim (M) 0.361 —0,128 —0.092 0.333 —0.004 —0.006 —0.120 —0.622 0.135 
Bhatu (C1) —0.348 0.134 0.351 0.527 —0.280 0.337 0.041 0.209 —0.870 
Habru (C2) —0.220 —0.128 —0.171 0.024 —0.214 —0.177 0.308 0.540 —0.655 
Bhil (Bh) —0.988 —0.079 —0.166 —0.462 0.436 —0.257 —0.039 —0.401 0.156 
Dom (D) —0.302 —0.101 0.152 0.035 0.677 0.474 0.590 0.115 0.360 
Ahir (A1) —0.143 0.032 —0.053 —0.353 —0.300 —0.120 —0.309 0.165 0.003 
Kurmi (A2) 0.071 —0.026 —0.027 —0.285 —0.062 —0.269 —0.312 —0.129 —0.033 
Other Artisan 

(Aa) —0.107 —0.253 —0.140 —0.427 —0.039 —0.440 —0.314 —0.229 —0.079 


Kahar (A4) +0.066 —0.377 —0.271 —0.456 0.093 —0.377 —0.455 —0.735 —0.273 


Table 9b.16. First the values of Y1, Yo, --- are obtained by substituting 
the values from Table 9b.1y, and then by division by the corresponding 
standard deviations y1, ---, yo are derived. The D? corresponding to 


TABLE 9b.16. Mean Values of Transformed Characters 


vu v2 y3 us us U6 v7 vs v 


Brahmin 
(Basti, B1) 0.534 0.323 0.001 0.158 —0.066 0.440 —0.003 0.705 0.189 


Brahmin 


(Other, B2) 0.448 0.254 —0.109 —0.054 —0.194 0.223 0.656 0.206 


Chattri (Ch) 0.634 0.040 —0.303 0.641 —0.350 —0.157 —0.773 0.441 
Muslim (M) 0.361 —0.204 —0.111 0.316 —0.023 —0.213 —0.756 0.339 
Bhatu (C1) — 0.348 0.207 0.420 0.541 —0.322 —0.040 0.150 —1.207 
Habru (C2) —0.220 —0.086 —0.083 0.084 —0.152 0.435 0.545 —0.760 
Bhil (Bh) — 0.988 0.119 0.062 —0.319 0.623 0.236 —0.301 0.381 
Dom (D) —0.302 —0.040 0.313 0.071 0.711 0.592 —0.260 0.388 
Ahir (A1) —0.143 0.061 —0.053 —0.342 —0.289 —0.235 0.462 —0.015 
Kurmi (A2) 0.071 —0.041 —0.032 —0.298 —0.075 —0.286 0.081 —0.016 

Other Artisan 
(Aa) 0.107 —0.237 0.009 —0.387 —0.004 0.196 0.020 0.064 
0.066 —0.399 —0.112 —0.414 0.142 —0.258 —0.360 —0.530 —0.012 


Kahar (A4) 


any two groups is the sum of squares of the differences in the values 
of yı, ***, yo The values of D? are given in Table 9b.1e where corre- 
sponding to each group the other groups are arranged in increasing order 


of D’. 


THE DETERMINATION OF GROUP CONSTELLATIONS 361 


TABLE 9b.le. Values of D? (Based on 9 Characters) Arranged in Increasing Order 


of Magnitude 
Brahmin Brahmin Bhatu Habru Dom Bhil 
(Basti, By) | (Other, Be) (C) (C2) (D) (Bh) 
B 0.27 | Be 0.27 | Co 1.32 | Ar 1.26 | Bh 1.15 | D 1.15 
Ar 1.17 | Ar 0.78 | Ar 2.68 | Ci 1.82 | Co 2.11 | Ag 1.75 
Ao 1.48 | As 1.03 | Ae 2.98 | Az 1.53 | Ag 2.31 | Ap 2.23 
As 2.13 | As 1.47 | As 3.35 | Be 1.63 | Az 2.41 | Ag 2.24 
Cy 2,23 | Ce 1.63 | Bı 3.48 | As 1.67 | M 2.47 | Co 2.48 
M 2.86 | M 2.62 | Bo 3.61 | D 2.11 | Ay 2.66 | Ai 2.53 
D 2.86 | Ay 2.72 | A, 4.20 | Bı 2.23 | Be 2.81 | M 8.16 
Ch 3.05 | D 2.81 | M 4.46 | Bh 2.43 | Bı 2.86 | Be 3.82 
Ay, 3.30 | Ch 2.87 | D 4.52 | Ay 2.87 | Ar 2.91 | By 4.45 
Cı 3.48 | C, 3.61 | Bh 5.08 | M 3.74 | Ch 3.84 | Ch 5.02 
Bh 4.45 | Bh 3.82 | Ch 5.25 | Ch 4.68 | Cı 4.52 | Cı 5.08 
Chattri | Muslim Ahir Kurmi Bi ae Kahar 
(Ch) (M) (AD (Az) (Ay) 
(A3) 

M 0.40 | Ch 0.40 | Az 0.30 | As 0.12 | Az 0.12 | As 0.43 
As 2.12 | Ay 0.90 | As 0.49 | Ar 0.30 | Ay 0.43 | Az 0.58 
Ay 2.24 | Ao 1.34 | Be 0.78 | Ay 0.58 | Ar 0.49 | M 0.90 
As 2.72 | As 1.45 | Bı` 1.17 | Be 1.03 | M 1.45 | Ai 1.52 
Be 2.87 | A; 2.45 | Ce 1.26 | M 1.34 | Be 1.47 | Bh 2.24 
Bı 3.05 | D 2.47 | Ay 1.52 | Bı 1.48 | Cy 1.67 | Ch 2.24 
Ar 3.38 | Bə 2.62 | M 2.45 | Ca 1.53 | Bh 1.75 | D 2.66 
D 3.84 | Bı 2.86 | Bh 2.53 | Ch 2.12 | Bı 2.13 | Be 2.72 
Cz 4.68 | Bh 3.16 | Cı 2.68 | Bh 2.23 | D 2.31 | Cy. 2.87 
Bh 5.02 | Œ 3.74 | D 2.91 | D 2.41 | Ch 2.72 | Bı 3.30 
Cı 5.25 | Cı 4.46 | Ch 3.38 | CG, 2.98 | Cı 3.35 | Cı 4.20 


9b.2 The Determination of Group Constellations 

A scrutiny of Table 9b.1e reveals that there is a pattern exhibited by 
the 12 groups. The Brahmins B; and Bz cluster together in almost each 
column and may be treated as a single unit, and also the four Artisans 
Aj, Ae, Ag, and Ay, which appear to be linearly arranged in the order 
Aj, Ag, As, and Ay with A, being most distant from A;. Bhatu (C1) 
and Habru (Ce) go together, with Cz being nearer to the Brahmin and 


362 DISTANCE AND GROUP CONSTELLATIONS 


Artisan clusters and C; far removed from them. Bhil (Bh) and Dom (D) 
are closer, and so also are Muslim (M) and Chattri (Bh). 


Ficure 1. Clusters and their interrelationships. 


The average D? within and between clusters found above is given in 
Table 9b.2a. The configuration of the clusters and their mutual rela- 
tionships is approximately indicated in Figure 1 where the square root 
of average D? (of Table 9b.2a) represents the distance. 


TABLE 9b.2a. Intra- and Inter-Cluster Average D? 


Cluster Groups A B C Ch D 
A Ay, Ao, As, A4 0.57 1.76 2.57 2.07 2.38 
B Bi, B2 1.76 0.27 2.74 2.85 3.48 
G Cı, C2 2.57 2.74 1.32 4.53 3.58 
Ch Ch, M 2.07 2.85 4.53 0.40 3.62 
D D, Bh 2.38 3.48 3.58 3.62 1.15 


No formal rules can be laid down for finding the clusters because a 
cluster is not a well-defined term. The only criterion appears to be 
that any two groups belonging to the same cluster should at least on 


THE DETERMINATION OF GROUP CONSTELLATIONS 363 


the average show a smaller D? than those belonging to two different 
clusters. A simple device suggested by K. D. Tocher is to start with 
two closely associated groups and find a third group which has the 
smallest average D? from the first two. Similarly, the fourth is chosen 
to have the smallest average D? from the first three, and so on. If at 
any stage the average D?” of a group from those already listed appears 
to be high, then this group does not fit in with the former groups and is 
therefore taken to be outside the former cluster. The groups of the 
first cluster are then omitted, and the rest are treated similarly. 

It is also useful to calculate the change in average D? within a cluster 
due to the inclusion of an additional group. If the change is appreciable, 
then the newly added group has to be considered as outside the cluster. 
The calculations arising from the D? table (9b.le) are given in Table 
9b.28. 


TABLE 9b.28. Computational Scheme for Finding Clusters 


Increase 
pone No. of in D? Average 
died D? Terms D? Cluster 
ef i (n) Increase | (2D*/n) 
uster ae 
As, Ag 0.12 £ 0.12 
Ay 0.91 3 0.39 0.30 P 
Ay 3.44 6 osa | 0.57 [An An As As 
M 13.02 10 2.39 1.30 
Bi, Be 0.2 1 anes 0.27 ; 
: 4.13 3 1.93 138 | Bu Bs 
M, Ch 0.40 1 ok 0.40 
D 6.71 3 3.15 2.24 M, Ch 
Bh, D 1.15 1 Shade i 1.15 
Ca 5.69 3 2.27 1.90 (PRP 
Cy Ca 1.32 1 ae 1.32 [Cy Ce 


There is a sharp increase in the average increase in D? when M is 
added to Ay, Az, Ag, Ay. This indicates that Ay, A2, As, A; form a 
cluster; therefore these are omitted and the rest are considered. Thus 
the other clusters are determined. 


364 DISTANCE AND GROUP CONSTELLATIONS 


For comparison of D? values it must be ascertained that they esti- 
mate the corresponding population distances. This is one reason why 
C.R.L. was found to be not useful. The bias in D? when the disper- 
sion is estimated on a large number of degrees of freedom is given by 


D a Ht + Nija 

Niji X Nije 

where 7;;1 is the number of samples supplying observations on both the 
ith and the jth character in the first group, and similarly n,;j. for the 
second group. The value ni; = ni, the number of observations on the 
ith character. This quantity, which depends only on the sample sizes 
and variances and covariances, can be calculated and subtracted from 
D?. If na = n for all i and nj2 = ng for all j, then the bias is simply 


ny + Ne 


NyN2 


where p is the number of characters. If n; and nj are individually 
large, the correction is trivial and need not be carried out. Also, if the 
sample numbers are such that the quantities 


nı + n 


Nın 


are of the same order for all pairs of groups under consideration, then 
no correction is needed because the D? values become comparable. In 
the illustration chosen above, the value of p(nı + no)/nıne is very 
nearly the same and also small for all pairs of groups, and hence no 
correction was carried out. 


9c The Use of Canonical Variates in Deriving Group Constellations 


9c.1 Graphical Methods of Representing the Groups 

The method of finding the group constellations, as described in 9b, 
becomes much simpler if the groups are characterized by two or three 
measurements. In the case of two characters zı and x2, we can repre- 
sent the mean values, expressed in standard deviation units, of the k% 
populations under consideration on a two-dimensional chart with axes 
inclined at an angle cos! r, where 7 is the correlation between x; and zp. 
In such a chart, the distance between two points is equivalent, apart 
from a constant multiplier, to Mahalanobis’ D between the two popu- 
lations represented by the two points. This is valuable as it facilitates 


THE PROBLEM OF MAXIMAL AVERAGE D? 365 


the study of group constellations and also serves as a pictorial repre- 
sentation of the configuration of various groups. 

For three characters we can construct a three-dimensional model 
representing the characters along three mutually inclined axes. In order 
that the distance between two points might be equal to the D, apart 
from a constant multiplier, between the two populations represented 
by them, the angle between the axes corresponding to x; and £z should 
be chosen as cos~! ry2,3 and the scale of x1, 2, v3 be properly adjusted. 

If the object of representation is only to measure D by the actual 
distance in space, we can transform the characters to independent varia- 
bles, in which case they can be represented along three mutually 
orthogonal axes. This method of representation fails when we are 
dealing with more than three characters. In such cases it might be 
useful to examine whether the configuration of the mean values with 
respect to p > 8 characters can be preserved, as far as possible, by 
representing the groups with respect to two or three suitably chosen 
functions of the p characters. A convenient measure for examining the 
adequacy of such a simpler representation is given by the ratio of the 
sum of squares of all possible k(k — 1)/2 distances arising out of the 
k populations in the simpler representation to the sum of squares of the 
p-dimensional representation. The former sum of squares is not greater 
than the latter, and the two representations are identical when the 
ratio is unity. When this ratio is close to unity, the simpler model 
might be considered as a fair representation of the groups in the total 


character space. 


9c.2 The Problem of Maximal Average D? 

the general problem to be solved is what are the best t(< p) linear 
combinations of the p variates which make the sum of all possible D 
values arising out of a number of populations as calculated with these 


t variates a maximum? 


Let 
mı *** Mp1 


mız *** Mp 
Mı °° Mpk 


p characters for the first, second, -++ kth 
populations, and A the common dispersion matrix. Suppose that = 1 
and the required linear function is Lay + lote +++ lptp with its 
variance 1Al’ where 1 is the vector (l, l2, +5 L): The D? with respect 


represent the mean values of 


366 DISTANCE AND GROUP CONSTELLATIONS 
to iz, +---+ 1,2, between the ith and jth populations is 
(Li (mig — m) ++ +--+ Up(mp: — mp;)/? 
The sum of all possible D? values is proportional to 
LEI j1;b;; 
where 
bij = E(Mmir — Mix) (mj, — Mj) 

and M: = (Zmj,)/k. This is nothing but between populations variance 
with respect to the chosen linear compound. 

To find the best linear function, 2Z1,l;b;; is maximized, subject to 


the condition 22J,l;\;; = 1. Introducing the Lagrangian multiplier A, 
the vector 1 is obtained as the solution of 

I(B — +A) = 0 
where B = (bij), A = (Ax), and à is the greatest root of the equation 
(see 1c.4). 

|B-— a| =0 

Suppose that we want the best two functions lx’ and mx’ where 1 and 

m can be chosen to satisfy the conditions 1Al/ = 1 = mAm’ and lAm’ 
= 0. Introducing Lagrangian multipliers c1, C2, c3, the expression to 
be differentiated is 


IB! + mBm’ — cilAl’ — 2c3lAm’ — comAm’ 


The equations are 
1B — clA — cgmA = 0 


mB — comA — cslA = 0 


The value of c3 can be shown to be zero by multiplying the first equa- 
tion by m’ and the second by I’ and adding. This shows that cı and c2 
are roots of the equation 
|B— A] =0 

Since the maximized value of 1B! + mBm’ is cı + c2, it follows that 
cı and cz correspond to the first two largest roots. Also the vectors 
corresponding to any two roots satisfy the condition lAm’ = 0 (see 
1c.3), so that Ix’ and mx’ are uncorrelated. The best two linear func- 
tions are thus the first two canonical vectors associated with the matrices 
B and A as discussed in 1c.3. 

Similarly, the first ¢ canonical variates give the best ¢ linear functions. 
The sum of all possible D? values with respect to all the p characters 
is equal to the sum of all the roots \1 +- +--+ dp, and the corresponding 


AN ILLUSTRATIVE EXAMPLE 367 


sum for the ¢ largest canonical variates is M + de +--+, so that 
the adequacy of the fit is judged by the smallness of the sum of the 
residual roots M41 + ** -+ Ap Or its ratio to the total. 


9c.3 An Illustrative Example 

In the example of 9b, the original variables have been already trans- 
: t, so that, if the transformed values (for 
which A = J, the unit matrix) are used, the problem reduces to the 
determination of the latent roots and vectors of the between product 
sum matrix. From the table of mean values of the transformed char- 
acters the between group sum of squares and products are calculated 
(Table 9¢.3a). It may be observed that the mean values of Table 


formed into an uncorrelated se 


Between Product Sum Matrix Using Mean Values of the Trans- 


TABLE 9¢.3a. 
formed Characters 


Matrix A 


Ys ya ys yo y7 ys Yo 

yı 2296 0.067 —0.567 0.709 —0.991 0.074 —0.561 0.014 0.682 
y 0.067 0.499 0.113 0.311 —0.158 0.403 0.239 0.708 —0.104 
ys —0.567 0.113 0.417 0.070 0.270 0.314 0.254 0.197 —0.489 
ya 0.709 0.311 0.070 1.472 —0.503 0.437 0.192 —0.424 —0.412 
ys —0.991 —0.158 0.270 —0.503 1.295 0.187 0.571 —0.604 0.806 
ys 0.074 0.403 0.314 0.437 0.187 0.731 0.447 0.498 —0.020 
y7 —0.561 0.239 0.254 0.192 0.571 0.447 1.022 0.488 —0.059 
ys 0.014 0.708 0.197 —0.424 —0.604 0.498 0.488 3.075 —1.140 
yo 0.682 —0.104 —0.489 —0.412 0.806 —0.020 —0.059 —1.140 2.722 


yı y2 


m for any character over all the groups 


9b.1ô are so adjusted that their su 
is zero, in which case no correction is necessary for the raw sum of 
squares and products. 

on. We 


One method of determining the canonical vectors is by iterati 
start with a trial vector, multiply each row of the matrix with this 
vector, and thus obtain a derived vector which is a better approxima- 
tion than the trial one. The convergence will be quicker if we start 
not with the original matrix but with a suitable power of it (Hotelling, 
1936). The canonical vectors associated with a symmetric matrix, A; 
and A®’, a suitable power of A, are the same; the canonical roots of A 
are those of A raised to the power 2P. If lis the vector associated with 


the root à of A, then 
1A — MN) =0 


+ AI), the equation reduces to 
(4? — 71) = 0 


Multiplying by (A 


368 DISTANCE AND GROUP CONSTELLATIONS 
Multiplying by (4? + 22), 
1(A* — XMI) =0 


and so on. Hence the result stated above is true. 

From matrix A of Table 9c.3q first A?, the square of A, is obtained 
and then A* by squaring A’. Finally, by squaring A*, A® is obtained, 
as in Table 9¢.38. Choosing the trial vector (1, ---, 1), the first approxi- 


TABLE 9c.38. Matrix A8 


19264 — 3164 — 7381 5938 — 4662 — 3405 — 8659 —17324 16812 
— 3164 5954 4195 1770 — 8053 4452 4133 24372 —21019 
— 7381 4195 4480 — 703 — 3115 3437 4788 18446 — 16589 

5938 1770 — 703 3506 — 6092 918 — 1344 5443 — 4487 
— 4662 — 8053 — 3115 — 6092 15902 — 5395 — 2130 —30993 26424 
— 3405 4452 3437 918 — 5395 3421 3515 18386 — 15884 
— 8659 4133 4788 —1344 — 2130 3515 5313 18398 —16351 
— 17324 24372 18446 5443 —30993 18386 18398 101184 — 87841 

16812 —21019 —16589 —4487 26424 —15884 —16351 —87841 77651 
mation is 


y 
—2581, 12640, 7558, 4949, — 18114, 9445, 7663, 50071, — 41284 


which are simply the column totals. If we divide by the highest quan- 
tity in the set, the vector reduces to 


—0.0515, 0.2524, 0.1509, 0.0988, —0.3618, 0.1886, 0.1530, 1, —0.8245 


Multiplying each row of A® by this vector, we derive a second approxi- 
mation, and so on; the operations are repeated until stable values are 
obtained. After five operations the following vector is obtained. 


—0.1849, 0.2400, 0.1865, 0.0507, —0.3008, 0.1816, 0.1855, 1, —0.8755 


The highest value used in the last stage of division is 206,926, and this 
gives the eighth power of the first canonical root. 


M8 = 206,926 or Ay = 4.620 


This vector is standardized by dividing each element by the square 
root of the sum of squares of all the elements. The standardized vector is 


—0.129, 0.167, 0.130, 0.035, —0.210, 0.127, 0.129, 0.698, —0.611 


AN ILLUSTRATIVE EXAMPLE 369 


From the (z, j)th element of the matrix A® is subtracted the product 
A18 X ith element X jth element of the first vector to obtain the reduced 
matrix given in Table 9c.8y. The above process of choosing a trial 
vector and finding better approximations is repeated on this reduced 
matrix. 


TABLE 9c.3y. The Reduced Matrix Eliminating the First Root and the Vector 


15821 1306 3907 6883 —10266 — 22 —5202 1304 503 
1306 151 — 315 543 — 7738 61 — 354 190 151 
— 3907 — 315 976 —1656 2538 24 1302 — 346 —138 
6883 543 —1656 3247 — 4554 — 10 —2292 331 — 11 
—10266 — 778 2538 —4554 6783 109 3494 — 678 115 


- 22 61 24 — 10 109 98 121 88 135 
— 5202 — 354 1302 —2292 3494 121 . 1844 — 298 17 
1304 190 — 346 331 — 678 88 — 298 413 381 
503 161 — 138 = Il 115 135 17 381 416 


The second vector is found to be 


0.722, 0.062, —0.193, 0.342, —0.508, —0.004, —0.259, 0.058, 0.013 
$ 


and the second root is 
(27,347) = 3.629 


The first two canonical vectors supply the best two linear functions. 


yı y2 Y3 Y4 Ys Yo uw ys Yo 
zı —0.129 0.167 0.130 0.035 —0.210 0.127 0.129 0.698 —0.611 
22 0.722 0.062 —0.193 0.342 —0.508 —0.004 —0.259 0.058 0.0138 


From the mean values of y1, Y2, -** given in Table 9b.16 the mean values 
of 21, z are calculated as shown in Table 9¢.86. 


Taste 9c.35. Mean Values of the Canonical Variates 


D A Aa As As 
0.012 —0.129 —0.576 
0.070 —0.170 —0.106 


Bı Ba Ch M Cı C2 . Bh 
zı 0.437 0.380 —0.859 —0.856 1.088 0.920 ` —0.410 —0.357 0.872 — 
z2 0.535 0.423 0.918 0.405 0.031 —0.133 —1.217 —0.783 0.028 


1 = 4.619 and for z2 = 3.630. They corre- 


spond to the first two canonical roots, thus providing a check on the 
calculations. These two roots alone account for a total variation 
4.619 + 3.630 = 8.249 out of a total of 9, corresponding to 9 pel 
formed characters. The percentage of variation absorbed is 91. , 50 
that a two-dimensional representation gives a fairly accurate pigas 
of the configuration of the groups in the nine-dimensional — F 
two-dimensional representation with the canonical variates as coordinate 


The sum of squares for z 


370 DISTANCE AND GROUP CONSTELLATIONS 


axes is given in Figure 2. A third root may absorb some more variation, 
in which case a three-dimensional representation will supply an almost 
true picture of the configuration of the groups. 


Àz 
15 


@) chattri 


Basti Brahmin 


Kurmi Ahir Bhatu 
®© @) Mi 
-15 =1.0 -0.5 Others |0 0.5 @* 15 
Habru 


-15 
Figure 2. U.P. Anthropometric survey group constellations in the (Ay-A2) chart. 


9d A Test for Reduction in the Number of Dimensions 


9d.1 The Analysis of Neurotic Cases 

In the previous section no sampling problem was considered, the 
object being to obtain the best method of representation of p-dimen- 
sional data in a smaller number of dimensions. No question was asked 
whether the variation between the groups is entirely confined to a 
smaller space or whether it can be explained by variations with respect 
to a smaller number of hypothetical characters. Such a question would 
require the estimation of these hypothetical characters, if they exist, 
and a test of significance for the residual variation. The algebra is the 
same as in the previous section, but sample sizes come into the calcula- 


THE ANALYSIS OF NEUROTIC CASES 371 


tions when tests of significance are considered. The method is illus- 
trated by an example discussed elsewhere by the author and Patrick 
Slater who supplied the material and who also provided a suitable 
psychological interpretation of the statistical analysis. 

The mean values of scores on 5 types of neuroties and normal cases 
and the within dispersion matrix have already been given in Table 
8b.9a. For the following analysis we need the between and within 
product sum matrices. 


TABLE 9d.læ. Analysis of Dispersion 


S.P. Matrix between Groups S.P. Matrix within Groups 
5 D.F. 250 D.F. 
A B Cc A B Cc 
A 367.7248 161.7773 76.2389 575.2127 62.8946 118.5423 
B 161.7773 76.4732 33.0330 62.8946 151.8666 8.9436 
Cc 76.2389 33.0330 17.7109 118.5423 8.9486 148.7735 


A glance at the mean Table 8b.9a shows that the most conspicuous 
single difference is between the normal cases and the neurotics, but 
there are also marked differences between the groups of neurotics. The 
five cases of post-traumatic personality change approximate the normal 
cases. Closely similar to one another but distinctly different from the 
normal and the remaining neurotic groups are the hysterias and anxiety 
states, Still further removed from the normal are the obsessional and 
psychopathic cases; but between these two groups some differences 
appear to exist. Whereas the obsessionals exhibit more pointers per 
person than the psychopaths, this difference is wholly due to an excess 
of symptoms of inadequacy and shyness; there is less evidence of in- 
stability among them than among the psychopaths. This suggests that, 
in addition to variations in the degree to which the neurotic groups 
differ from normal, a further source of variation may be found and may 
prove useful for differential diagnosis. We may ask the following 
problem. 

Is there sufficient evidence to demonstrate variation between the 
groups in more than one dimension, or can the observed differences 
between them be treated as differences simply in degree? 

The determinantal equation for à, giving the canonical variances, is 


| Between matrix — A Within matrix | = 0 


367.7248 — 575.2127 161.7773 — 62.8946 76.2389 — d118.5423 
161.7773 — 62.8946 76.4732 — 151.8666 33.0330 — 8.9436 | = 0 
76.2389 — 118.5423 33.0330 — 08.9436 17.7109 — \148.7735 


372 DISTANCE AND GROUP CONSTELLATIONS 


Various methods have been suggested for obtaining the solutions of this 
equation and the canonical vectors associated with the roots. To apply 
the method described in the previous section it is necessary to multiply 
the determinantal equation by the reciprocal of the dispersion matrix 
given after Table 8b.9«. This is equivalent to multiplying the between 
sum of products matrix by the reciprocal of the dispersion matrix and 
subtracting 250) times the unit matrix (250 being the degrees of freedom 
of the within matrix). The resulting equation is 


135.2909 — u 58.6725 27.3495 
209.8339 101.4343 — u 42.7341 =0 
7.7001 2.6617 5.4009 — u 


where u = 250. This is of the form already considered. In the above 
example it is convenient to expand the determinant because the resulting 


equation is a simple cubic. 
p? — 242.12614? + 2365.848134u — 5455.7616654 = 0 
The three roots of this equation are 
pı = 232.0312 mo = 6.4481 u3 = 3.6468 


The total variation is 242.1261, out of which the variations absorbed 
by the 3 roots are 95.8%, 2.7%, and 1.5%; so the second and third are 
relatively unimportant. 

Since the dispersion matrix is estimated on a large number of degrees 
of freedom, the total variation 242.1261 is approximately distributed 
as x” with p(k — 1) (the number of characters X 1 minus the number 
of groups), equal to 15 degrees of freedom. This is an alternative test 
to the A criterion (see 7d.1) for judging the overall group differences. 

If the whole variation is concentrated in the first canonical (hypo- 
thetical) variate, then the residual variation 


242.1261 — 232.0312 = 10.0949 


should be attributed to chance. The first canonical variance is dis- 
tributed as x? with (p + k — 2) = 7 degrees of freedom, so that the 
residual is a x? with (15 — 7) = 8 degrees of freedom. The value of 
x? = 10.0949 has more than 20% probability on 8 degrees of freedom, 
thus providing no evidence of variation in the other dimensions. 

If this x? is significant, we proceed to test whether the first two 
canonical variates are sufficient to explain the total variance. In this 
case the significance of the sum of the other variances has to be tested. 


THE ANALYSIS OF NEUROTIC CASES 373 
The distribution of the degrees of freedom among the various roots is 
pE- =(Ptk-2DtMtkh—-A+e 


each term being 2 less than the previous one. The degrees of freedom 
for the residual in any case are the total minus the degrees of freedom 
for the roots accounting for the total variation on the specified hypothe- 
sis. This is equivalent to the sum of degrees of freedom of the smallest 
canonical variances added to form the residual. 

The x? approximation can be slightly improved (Bartlett, 1948) by 
using [N — 14(p + k)] log. (1 +) instead of u = 250) itself, where M 
is the total sample size for all groups put together. In the present 
example: 


Root Term x D.F. 
First (255 — 3(p + K)] loge (1 + ^) 165.14 7 
Second [255 — 3(p + k)] loge (1 + ^) 6.35 5 
Third [255 — 3(p + k)] loge (1 +s) 3.62 3 


The x? for the residual after the first root is eliminated is 6.35 + 3.62 
= 9.97, with 5 + 3 = 8 degrees of freedom. In the previous approxi- 
mation the residual x? is 10.0949, which is very nearly the same. Al- 
though no exact test is known, the two tests described above are equiva- 
lent in large samples. 

In the present example only the first root is significant, but in prob- 
lems of this nature it is of practical importance to consider at least 
some of the smaller roots in determining the configuration of the various 
groups. It might happen that the variation in the dimension corre- 
sponding to a smaller root is concentrated among a few of the groups, 
in which case very large samples would be necessary to establish sig- 
nificance. Any noticeable difference between two groups in the mean 
values of the canonical variate corresponding to such a root cannot 
be strictly interpreted as real when the overall test does not establish 
the significance of this root. But this additional analysis may throw 
some light on the prospects of future investigations. In the present 
case the canonical variates corresponding to the first two roots have 
been calculated. The configuration of the various groups is indicated 
in Figure 3. 

The coefficients kı, ke, ka of the best linear fit or the first canonical 
variate are obtained from the equations 


(135.2909 — 232.0312)kı + 58.6725k_ + 27.3495k = 0 
209.8339kı + (101.4343 — 232.0312)k2 -+ 42.7341k3 = 0 
7.7001k; + 2.6617k2 + (5.4009 — 232.0312)k3 = 


374 DISTANCE AND GROUP CONSTELLATIONS 


The matrix of these equations is obtained by substituting for p the 
maximum root 232.0312 in the matrix of the determinantal equation 
foru. Putting ks = 1 arbitrarily, we find that the proportional values 
of kı and ke are kı = 18.84886, ke = 30.61225. 

The variance of kA + kəB + kC is arkı? + azok? + agghks? + 
Qayokyke + 2ayghikz + 2ae3kek3, where a;; are the elements of the dis- 
persion matrix within the groups. Using the values of hy, ko, ka obtained 
as above, we find the variance as 1697.6281 or the standard deviation as 
V 1697.6281 = 41.2023. Dividing kı, ko, k3 by this value we find the 
standardized best linear function 0.45754 + 0.7430B + 0.0243. 

Similarly, the standardized linear function for the second dimension 
is 0.4071A — 1.0473B + 0.3292C. The mean values of these variates 
for all the groups are given in Table 9d.1. 


TABLE 9d.18. Mean Values of the First Two Cannoical Variates 


Group A A2 
Normal 0.3879 0.1637 
Personality change 0.7891 0.3604 
Anxiety state 2.2249 0.2105 
Hysteria 2.3227 0.1120 
Psychopathy 3.1339 —0.1115 
Obsession 3.3601 0.6204 


These values can be used for a pictorial representation of the groups 
as shown in Figure 3. In the first dimension the normal group occupies 
a position at one end of the scale; the neurotic groups are spread out 
towards the other extreme, the small group of cases of post-traumatic 
personality change being the only one which is not clearly distinct 


Àz 

15 

1.0 

Obsession 
Personality © 
0.5 change Anxiety 
© state 
© Normal ® © Hysteria 
l Tà 
2.5 300 35 4.0 
Psychopathy 


Ficure 3. The configuration of neurotic groups, 


THE ANALYSIS OF NEUROTIC CASES 3875 


from normal. At the other extreme are the psychopathic personalities 
and the obsessionals; in terms of this dimension they lie very close 
together. The anxiety states and the hysterias are also found to approxi- 
mate one another closely, but they lie a considerable distance from the 
the other groups. 

The preponderance of the part of the total variation between the 
groups which occurs in this dimension is the most striking finding in the 
analysis. That it is the first of the dimensions found, that each of the 
original scores makes a positive contribution to the variation in it, 
and, above all, that in it the normal group appears at one pole and the 
Various neurotic groups diverge all towards the other—these facts sug- 
gest that it can be identified with the general factor among neurotic 
characteristics described by Eliot Slater, Eysenck, and other writers. 
Whether this general dimension of neuroticism indicates the existence 
of any unitary psychological trait or whether it is simply a reflection 
of the fact that most neurotic characteristics are non-specific and are 
found with varying degrees of frequency among all neurotic states is a 
Controversial question upon which it is unnecessary to enter here. 
Mayer Gross, Moore, and Patrick Slater prefer the latter alter- 
native. 

Although the variation in the second dimension is very much smaller, 
the arrangement of the groups it discloses invites some psychological 
Consideration. The equation defining variation in this dimension con- 
trasts the scores for inadequacy with the scores for instability by 
giving them opposite signs. At one extreme is the obsessional group, 
Which in terms of average scores is the most highly inadequate but not 
the most unstable ; at the other extreme is the psychopathic group—the 
Most unstable but not the most inadequate. The psychological picture 
Presented by this arrangement of the groups is a familiar one: obsessional 
Cases are notoriously fixed in their habits; psychopaths are notoriously 
irresponsible and unreliable. What is surprising is not that some con- 
trast of this kind was found but that the observations exhibit so little 
Variation in this respect. But, as the score is based on a summation 
of three pointers only, it is more likely that variation in this dimension 
is insignificant than that it has been insufficiently accurately meas- 
ured. 

Example 1. If B and W denote the between and within sum of 
Product matrices and 7 the total, then the roots of 


|w—yr|=0 
and 
|B-Ww|=0 


376 DISTANCE AND GROUP CONSTELLATIONS 


are connected by the relation 


and the canonical vectors 
1(W —yT) =0 and 1(B — \W) =0 


are the same. 
Example 2. The equation for y can be deduced as follows. Evaluate 
the determinant 


uy) =|W—4T| 


for y = 0, 1, 2, 3, --- and construct the equation for y by the method 
of finite differences. 


[Hint: If uo, Auo, A?uo, +++ denote the forward differences, then 
y4 -1 -DY - 
u(y) = uo + Auo + EED Muo p PEK DEM? pey ip fou, 
Example 3. If Yo is an approximate solution of 


uly) = uo + yuo + “FY tag t 


then the correction ô to Wo is given by 
n Be 
—u(po) = ô | auo tuto 


+e — ne - 9S, a 


Example 4. All the solutions for y lie between 0 and 1. [Hint: 
y=1/1+)] 

Examples 1 to 4 supply an alternative method of evaluating the 
canonical variances without reducing the determinantal equation 

B— aW | = 0 to the form le — pl | =0. It is convenient to 
find first the values of y, which all lie between 0 and 1. It is easy 
to guess an approximate root and then obtain the correction terms by 
iteration. 

If in any problem all the roots are needed, they may be obtained by 
Graff’s root-squaring method, which is illustrated below. If the equation 
is 

apt? + a? ™ +---+ a, =0 


THE ANALYSIS OF NEUROTIC CASES 377 


then the equation whose roots are squares of the above roots is com- 
puted in the following manner 


a? gpa gp? 
ag” 2aga2 2aoas 
2 
—a;* —2aia3 —2aya5 Š 
2 2 
ay aga, 2aga¢ 
Total bo by be bs bs 


This gives the first approximation to the square of the roots 


Starting from this equation, the first operation is repeated. If co, c1, 
`+- are the coefficients, then 


yı T hatin 


Co ĉi 
By using logarithms, the values y1, Yə, +++ at any stage are obtained. 
The process can be terminated as soon as stable values of Y1, Y2, ++: 
are obtained. Each time, the powers of roots obtained are 2, 4, 8, 16, 
32, ---, and so on, and the convergence will depend on the separation 
of roots in the equation. This method is suitable only when all the 
roots are different. In problems of the nature discussed in this chapter, 
Cases of exactly equal roots rarely occur. 

Example 5. If b is an arbitrary vector, show that 
y. n 

e 
where à; is the dominant root and x; is the first characteristic vector 
arising out of the determinantal equation 


|A—aI| =0 


This justifies the iterative method adopted on p. 368. 
Example 6. Taking an arbitrary vector k, form the matrix P, whose 


rows are the vector, 
k, kA, -+-, kA? 


where A is a symmetric matrix of order p. Solve the equations 


bP =0 


378 DISTANCE AND GROUP CONSTELLATIONS 
and show that the expansion of | A — MI | = 0 is 
bpa? + bp?! H+ bo = 0 


where b = (bo,***, bp)- 
[Hint: First show that A satisfies the equation 


bpA? +-- -+ bol = 0] 


References 


Bartuetr, M. S. (1947). Multivariate analysis. J.R.S.S. Suppl., 9, 76. 

Horec, H. (1936). Simplified calculation of principal components. Psycho- 
metrika, 1, 27. 

Manatanosis, P. C. (1936). On the generalized distance in Statistics. Proc. Nat. 
Inst. Sc. (India), 12, 49. 

Manatanosis, P. C., D. N. MAJUMDAR, and C. R. Rao (1949). Anthropometric 
survey of the United Provinces, 1941: A statistical study. Sankhyā, 9, 90. 

Morant, G. M. (1923). A first study of the Tibetan skull. Biom., 13, 176. 

Rao, C. R., and Patrick SuaTer (1949). Multivariate analysis applied to differ- 
ences between neurotic groups. British J. Psy., Statistics Section, 2, 17. 

Tinpesuey, M. L. (1921). A first study of the Burmese skull. Biom., 13, 176. 


Appendix. 


Miscellaneous Problems 


The following problems are intended to serve as exercises and dis- 
cussion problems. 


1. Distinguish between a mathematical limit and a stochastic limit. 
Show that if zn — x stochastically then the limiting distribution of x, 
is the same as the distribution of z. 

2. If x is a stochastic variable assuming only non-negative values and 
E(x) = t, then 


1 
P@<M)>1-% 


3. Find the general form of the probability law which has the 
median as the maximum likelihood estimator of a parameter. 

4. On the basis of a sample of size n how do you test whether the 
observations have arisen from a rectangular population with an unknown 
range? 

This is equivalent to testing whether the middle (n — 2) observations 
have a rectangular distribution in the range determined by the smallest 
and the biggest observation. 

5. From the joint distribution of Z and s? as defined in 2b.1 find the 
distribution of &/s when u = 0 and show how this can be used to test 
for an assigned value of the coefficient of variation in the population. 

Derive a locally most powerful unbiased test for the null hypothesis 
that the coefficient of variation has an assigned value. 

6. Show without actually working out the distribution that, for sam- 
ples from a multivariate normal distribution, the distribution of the 
computed multiple correlation coefficient contains only the correspond- 
Ing population parameter and a similar result is true for the partial 
Correlation coefficient. 

7. Prove that 

aGa’ > (aTa’)’4(aGT—!G@a’)4 


Where a is an arbitrary vector and T and G are positive definite matrices. 
Make use of the fact that the ratio of the left-hand expression to that 
of the right is the correlation between the vectors x = aD and y= 
aGD~ where DD’ = T. 
Hence, following the notions of 8c.2, show that the genetic advance 
379 


380 APPENDIX 


as determined by the discriminant function is greater than that for the 
straight selection function. 

8. Calculate the information limits to the variances of unbiased esti- 
mates of pı and pe in the problem of 4c.1 and show that they are actually 
attained for the maximum likelihood estimates. 

9. A coin is tossed a number of times till x heads appear. Find the 
average and the variance of the number of tosses. Determine a recur- 
rence relation for the evaluation of higher moments. 

10. Using the data considered in 7d.5, obtain the function of the 
measurements which characterizes most effectively the secular changes 
in progress. Assume that regressions with time are linear. 

Hint: The probability density for any time ¢ can be written 


const. exp — $227 (x; — a; — bi) (xj — aj — bjt) 


The likelihood ratio for two different time points gives the best dis- 
criminating function. In this case it is linear in the measurements. 
Estimate the constants. 

11. Give any method by which the number of fishes in a pond can 
be estimated. 

In an investigation to estimate the number of births in a locality 
during a particular month the following data are obtained: 350 families 
reported no birth and 100 reported a single birth of which 25 took 
place in the government hospital. Knowing the exact number of births 
in the hospital, how do you estimate the total number of births in the 
locality? Find the standard error of this estimate. 

12. Comment on the following anecdote: On examining the chart 
giving the standard heights and associated weights of persons a husband 
remarked that his wife weighs more than she should. The wife retorted 
saying that she is not so tall as she should be. 

43. A total number of 130 heads was observed when 100 rupee coins 
and 100 half-rupee coins were thrown together. Knowing that the 
rupee coin is unbiased, what can you say about the half-rupee coin? 

14. Show that under conditions of random mating the blood group 
frequencies in the four classes O, A, B, AB, can be expressed in terms 
of gene frequencies only. 

15. The following data relate to the distribution of age at death in a 
particular year classified according to civil conditions. 


Civil Age at Death 
Condition 0-15 15-30 30-45 45-60 60-75 Over 75 
Bachelor 300 120 50 60 45 10 
Married 10 70 100 120 50 12 


Widowed PT as 12 45 30 20 


APPENDIX 381 


What conclusions can you draw about the effect of civil condition on 
the age at death? What more is needed for a proper comparison of 
these distributions? 

16. How do you collect material to study whether the correlation 
coefficient between the measurements of head length and head breadth 
on the skull is greater than the correlation coefficient between the 
corresponding measurements on the living? What is the appropriate 
statistic to test the above null hypothesis if all four measurements are 
available for each of n individuals? 

17. Find on the basis of a sample of size n the best estimate of the 
correlation coefficient in a bivariate normal population when both the 
variables have the same mean and variance. -This coefficient is known 
as the intra-class correlation. 

Suppose that p measurements are available on each of two brothers 
from a number of families. Determine the best linear compound of the 
measurements which has the highest intra-class correlation coefficient 
between the brothers. 

Obtain the large and small sample distributions of the maximum 
correlation and indicate their use in tests of significance. 

18. The following estimates of the correlation coefficients between 
intelligence test scores were found in an investigation to study the 
relative influences of environmental and heredity factors. 


Two Brothers Two Brothers Twins Twins 
Reared Living Reared Living 
Apart Together Apart Together 
Correlation 
coefficient 0.235 0.342 0.451 0.513 
Sample size 50 40 45 55 


Comment on these figures, using tests of significance wherever necessary. 
19. It is said that man’s stature is variable during the day and that 
the range is half an inch, the maximum stature being observable when 
the man gets up in the morning and the minimum when he retires to 
bed. How will you proceed to study this variability? Also comment 
on the precision with which the stature has to be measured for such 
Studies. 
20. To determine the sex ratio three procedures were adopted: (1) 
A number of boys were asked to state the number of brothers (including 
imself) and sisters they have. (2) A number of girls were asked to 
State the number of sisters (including herself) and brothers they have. 
(8) Finally a number of parents were asked to give the number of sons 


382 APPENDIX 


and daughters they have. The totals of brothers, sisters, and families 
for the three procedures are given below. 


Procedure 
ni 2 3 
Brothers 203 140 120 
Sisters 130 220 115 
Total families 80 95 60 


Test whether the family size is the same in the three investigations. 
Find the estimates of the sex ratio for the three procedures, and test 
whether they are significantly different. 


Index 


(Names occurring in references are indicated by r, and those in footnotes by n.) 


Aitken, A. C., 29r, 127r, 253, 271r 
Analysis of covariance, concomitant 
variation, 119 
Analysis of dispersion, 258 
Analysis of variance, concomitant vari- 
ation, 119 
equality of regression equations, 112 
least square method, 83 
one-way classification, 89 
significance of regression coefficients, 
105 
test for an assigned regression function, 
115 
two-way classification, 91 
unequal numbers in cells, 94 
Anscombe, F. J., 209, 210, 220r 
Anthropological illustrations; asymmetry 
and kurtosis of nasal height dis- 
tributions, 219 
asymmetry of right and left femora, 87 
classification of the Highdown skull, 
291 
cranial capacity, 103 
differences in boys of different schools, 
262 
differences of the first and second born, 
245 
differences in sexing of skulls, 197 
differences in variabilities of males and 
females, 225 
estimation from incomplete data, 161 
femur and humerus differences in 
groups, 252 
group constellations of castes in United 
Provinces, 357 
group constellations using canonical 
variates, 370 
group differences in head breadth, 90 
group differences in nasal height, 94 
inheritance in man, 124 


Anthropological illustrations, regression 
of nasal index on weather factors, 
115 
secular variations in skull characters, 
266 
sexing of osteometric material, 304 
stature of prehistoric man, comments, 
115 
use of indices, 155 
Asymmetric matrix, canonical reduction, 
27 
Asymptotic distribution, of quantiles, 156 
of the median, 157 
of maximum likelihood estimates, 157 


Backcross data, 188 
Barnard, M. M., 266, 267, 271, 271r 
Bartlett, M. S., 201, 220r, 226, 228, 235r, 
236, 259, 261, 271r, 373, 378r 
Bateson, 180, 194 
Bayes’ theorem, 308 
Bernstein, F., 170, 174r 
Beta variable, distribution, 41 
distribution of product, 42 
moments of, 42 
Bhattacharya, A., 148, 174r 
Bias in an estimate, 151 
Binomial proportion, estimation, 134 
sin“! transformation, 210 
Binomial variate, distribution, 32 
distribution function, 33 
moments of, 34 
Biological illustrations, differential thick- 
ness of bark in different directions, 
241 
feeding experiment on pigs, 121 
milk feeding experiment, 217 
see also Anthropological illustrations, 
Genetic illustrations, Psycholog- 
ical illustrations 


383 


384 


Birnbaum, Z. W., 336, 349r 
Bivariate normal distribution, estimation 
of parameters, 161 
moments of, 153 
Bliss, C. I., 255, 256, 271r 
Blood groups, O, A, B, comparison of 
gene frequencies, 187 
consistency test, 184 
estimation of gene frequencies, 169 
Boas, 124 
Bose, R. C., 74r, 80n 
Brogden, H. E., 338, 349r 
Brookner, R. J., 236, 272r 
Canonical reduction, of asymmetric 
matrix, 27 
of symmetric matrix, 24 
Canonical variates, illustrations, 367, 
370 
use in group constellations, 364 
Cauchy population, distribution, 42 
distribution of the mean in samples, 43 
efficient estimate of the parameter, 166 
inconsistent estimate of the parameter, 
152 
Central limit theorem, 174 
Chapman, D. G., 336 
Characteristic vectors, 21 
Chi-square, x”, approximations to distri- 
bution of, 222 
distribution of least squares, 58 
distribution of sum of, 41 
general large sample theory of x? test, 
179 
goodness of fit test, 183 
limits to x? in test for specified vari- 
ance, 223 
multivariate tests of, 257, 294, 372 
non-central distribution of, 50, 57 
ratio of, 45 
test of homogeneity, 185 
tests in contingency tables, 192 
Classificatory problems, allocation of 
individuals to groups in a fixed 
ratio, 322 
differential predictors, 337 
discriminant function, 287 
doubtful region, 296 
in genetic selection, 329 
in job selection, 336 


INDEX 


Classificatory problems, with more than 
two groups, 307 
resolution of a mixed series, 300 
single predictor, 336 
with two groups, 286 
uncertainty of the a priori information, 
290 
Cochran’s theorem, 49, 55 
Cochran, W. G., 74r, 255, 256, 271r 
Coefficient of correlation, see Correlation 
coefficient 
Coefficient of racial likeness, C.R.L., 
355 
reduced form, R.C.R.L., 356 
Combination of data from various 
sources, 172 
Combination of tests, Py, 44, 217 
Concomitant variation, adjustment for, 
119 
Conditional tests, 201, 205 
Confidence interval, 276 
Consistency in estimation, 151 
Consistency relation of moments, in 
population, 27 
in samples, 17 
Contingency tables, with small samples, 
200 
tests of independence, 192 
Convergence theorems, 172, 173 
Correction for grouping, 301 
Correlation coefficient, estimate from 
parallel samples, 234 
multiple, distribution of, 63, 65 
partial, distribution of, 69 
simple, distribution of, 68 
tanh™ transformation, 231 
tests of significance, 230, 232 
Coupling data, 167 
Covariance analysis, concomitant vari- 
ation, 119 
Cramér, H., 158, 172, 174r 
Cranial capacity, estimation of mean, 
110 
prediction formula for, 103 
preservation of skulls with small, 111 


D?(Mahalanobis), applications, 246, 354, 
357 
distribution, 70 
generalization to many groups, 257 


INDEX 


Dandekar’s correction for continuity, 203 
see also Yates’ correction, 203 
Deficiency matrix, 3 
Degrees of freedom, of x°, 177, 179 
of least squares, 84 
Dependence of vectors, 2 
Determinantal equation, 22 
Determinants, 10 
reduction of, 31 
Discriminant function, construction of, 
247 
difficulties in its use, 289 
in genetic selection, 329 
in job selection problems, 336, 337 
as ratio of likelihoods, 287 
successive evaluation of these func- 
tions, 254 
test for an assigned, 248 
test for equality of functions, 250 
test for its coefficients, 251 
Discriminant score, linear, 316 
Discrimination of neurotic conditions, 
316 
Discriminatory problems, see Classi- 
ficatory problems 
Discriminatory topology, 352 
Dispersion matrix, 52 
Dissection of a compound frequency 
curve, 300 
Distance between two populations, 352 
Mahalanobis’ DF 70, 246, 354, 357 
Distribution, of x2 goodness of fit, 177, 
179 
of correlation coefficient, 68 
of Fisher’s z, 48 
of Hotelling’s T, 70 
of L- and M-statisties for homogeneity 
of variances, 226, 227 
of A-statistic (Wilks), 261 
of least squares, 58 
of linear functions.of normal variates, 
53 
of Mahalanobis’ D2, 70 
of maximum likelihood estimates, as- 
ymptotic, 157 
of mean and variance in normal 
samples, 46 
of multiple correlation coefficient, 63, 
65 
of non-central x2, 50 


385 


Distribution, of non-null #2, 48 

of P) of Pearson, 44 

of partial correlation coefficient, 69 

of product of beta variables, 42 

of quadratic forms, 49, 55 

of quantiles, asymptotic, 156 

of regression coefficients, 63 

of square of a normal variable, 40 

of Students’ t, 47 

of U-statistic, comparison of two D? 
values, 74 

of V-statistic, generalization of D?, 257 

of variances and covariances (Wish- 
art), 66 

of W-statistic, the difference of two D? 
values, 255 


Distribution of sum of variates, Cauchy, 


43 

gamma, 41 

normal, 39 

Poisson, 36, 37 
Distribution function, binomial, 33 

Poisson, 38 
Doubtful region in classification, 296 
Dugue, 158, 174r 


Efficiency of an estimate, 155 
Elderton, E. M., 220r 
Elementary matrix, 13 
Equations, linear, homogeneous, 5 
nonhomogeneous, 6 
numerical solutions, 31 
Estimation, linear, with correlated vari- 
ables, 81 
intrinsic properties of normal equa- 
tions, 79 
normal equations, 76 
observational equations, 75 
of regression coefficients, 103 
with restrictions on parameters, 81 
standard errors of estimates, 78 
with two sets of parameters, 118 
with weighted observations, 85 
method of maximum likelihood, 150 
combination of data, 172 
consistency and bias, 151 
efficiency, 155 
estimation from incomplete data, 
161 
method of scoring, 165 


386 


Estimation, method of maximum likeli- 
hood, optimum properties of esti- 
mates, 157 

the primitive postulate, 151 

relation to sufficient statistics, 151 

minimum variance, information limit 

to variance, 130 

a lower bound to variance, 143 

the problem of several parameters, 
144 

relation to sufficient statistics, 135, 
139 

Eysenck, 375 


F, f distributions, 45, 48, 49 
Factors of neurosis, 370 
Fairfield Smith, H., 330, 349r 
Fiducial interval, 276 
Fisher, R. A., 74r, 87, 87n, 124, 127r, 
128r, 131, 150, 168, 174r, 175r, 
200, 211, 220r, 231, 236, 237, 247, 
271r, 272r, 276, 287, 288, 289, 301 
Fisher’s z distribution, 48 
Frets, G. P., 245, 272r 
™ 
Gamma variate, distribution of sum of 
variates, 41 
estimation of parameters of distri- 
bution of, 134, 141 
moments of, 41 
Gauss, K. S., 76 
Generalized distance, see D? 
Generalized variance, 147 
Genetic advance, 332 
Genetic illustrations, comparison of gene 
frequencies, 186 
comparison of recombination fractions, 
188, 211 
consistency of blood group frequen- 
cies, 184 
detection of linkage, 181, 195 
discrimination between Iris versicolor 
and Iris setosa, 248 
estimation of blood group frequencies, 
169 
estimation of linkage, coupling and 
repulsion, 167 
inheritance in man, 124 
selection of poultry by discriminant 
function, 330 


INDEX 


Genetic selection, discriminant function 
in, 329 

Goodwin, C. N., 292, 349r 

Gosset, W. S. (Student), 87n, 128r 

Gray, H., 124, 127r 

Gross, Mayer, 375 

Grouping correction, 301 


Hartley, H. O., 228, 235r 
Homogeneity tests, contingency table, 
185 
of correlations, 233 
of variances, 226 
Hooke, B. G. E., 103, 111, 128r, 292, 349r 
Hotelling, H., 74r, 201, 220r, 236, 272r, 
286, 367, 378r 
Hotelling’s 7', 70 
Hsu, P. L., 74r, 237, 272r 
Huzurbazar, V. §., 158, 175r 
Hypothesis, composite, 176 
linear, 82, 83 
Neyman-Pearson theory, 278 
null, 274 
simple, 176 


Illustrations, see Anthropological, Bio- 
logical, Genetic, and Psychologi- 
cal 

Incomplete data, estimation from, 161 

Index, consistent estimate of, 155 

moments of, 154 

Information, 131 

Information limit, 131 

Intercross data, 180 

Intrinsic properties of normal equations, 
79 

Invariance, under orthogonal transfor- 
mation, 19 

of some statistics, 72 


Jeffreys, H., 275, 349r 


k-statistic, 301 

Kemsley, W. F. F., 225, 235r 

Kintchine’s theorem, 174 

Kolodzieczyk, St., 128r 

Koopman, B. O., 149, 175r 

Kurtosis, test for deviation from normal, 
219 


INDEX 


L test, 226, 227 
Large sample, standard errors of, index, 
154 
median, 157 
moments, 215 
quantiles, 156 
transformed statistics, 207 
Large sample tests, x? goodness of fit, 179 
of deviation from symmetry and nor- 
mal kurtosis, 219 
difference between means, 217 
homogeneity of parallel samples, 185 
homogeneity of Poisson samples, 205 
independence in contingency tables, 
192 
with transformed statistics, 207 
Latent roots of a matrix, 21 
Least squares, in estimation, 80, 81, 82 
evaluation of, 85 
two fundamental distributions in, 58 
Levi, F. W., 29r 
Levy, P., 175r 
Levy’s theorem, 174 
Liapounoff, A., 175r 
Liapounofi’s theorem, 174 
Likelihood, 150 
Likelihood ratio test, 226, 259, 285 
Limiting distribution, of binomial vari- 
ate, 38 
of maximum likelihood estimates, 157 
of quantiles, 156 
Limiting theorems, central limit, Lind- 
berg-Levy’s, 174 
convergence of sum, product and ratio, 
173 
Kintchine’s, 174 
Levy’s, 174 
Liapounoff’s, 174 
Slutsky’s, 173 
Linear equations, homogeneous, 5 
nonhomogeneous, 5 
numerical methods of solution, 31 
Linear estimation, sce Estimation, linear 
Linear hypothesis, nature of, 82 
test of, 83 
Linear transformation, 18 
Lindberg, J. W., 175r 
Lindberg-Levy theorem, 160, 174 


M test, homogeneity of variances, 227 


387 


Macdonell, W. R., 104n, 128r 
Mahalanobis, P. C., 246, 258, 272r, 355, 
357, 378r 
Mahalanobis’ D?, see D? 
Majumdar, D. N., 318, 357, 378r 
Markoff, A. A., 76, 128r 
Martin, E. S., 349r 
Mather, K., 167, 175r 
Matrix, characteristic vectors of, 21 
the determinant of, 10 
elementary, 13 
latent roots of, 21 
of linear transformations, 18 
numerical computations with, 31 
partitioned, 9 
of a quadratic form, 18 
rank of, 2 
reciprocal, 13 
reduction of an asymmetric form, 27 
sweeping out a, 2 
unit matrix, 8 
Maximum likelihood, see Estimation, 
method of maximum likelihood 
Merril, A. S., 175r 
Minimax, some cominents on, 313 
Minimum variance, see Estimation, mini- 
mum variance 
Mises, R. V., 312, 349r 
Mixed series, resolution of, 300 
Moments, of beta distribution, 42 
of binomial distribution, 34 
of bivariate normal, 153 
of gamma distribution, 41 
of multinomial distribution, 35 
of normal distribution, 39 
of Poisson distribution, 36 
of moment statistics, 215 
Moore, 375 
Morant, G. M., 292, 349r, 355, 378r 
Multinomial distribution, moments of, 
35 
Multiple correlation, distribution of, 63, 
65 
test for, 104 
Multivariate analysis, of dispersion, 259 
generalization of D?, 257 
internal analysis of variates, 264 
of neurotic patients, 370 
problem of secular variations in skull 
characters, 266 


388 


Multivariate analysis, problems of a 
single sample, 239 
problems of two samples, 246 
review of work on, 236 
successive evaluation of discriminant 
functions, 254 
test for additional information due to 
some characters, 252 
test for an assigned contrast of p cor- 
related variables, 243 
test for an assigned discriminant 
function, 248 
test for an assigned ratio of discrimi- 
nant function coefficients, 251 
test for asymmetry, 245 
test of differences in mean values, 262 
test for differences in two samples, 248 
test for equality of discriminant func- 
tions in parallel samples, 251 
test for equality of means of p corre- 
lated variables, 240 
use of the W-statistic, difference of two 
D? values, 255 
Wilks’s criterion, 258 
Multivariate normal population, distri- 
bution of linear functions from, 53 
distribution of quadratic forms in, 55 
variances and covariances in, 52 
Multivariate distributions, Hotelling’s 
7’, 70 
Mahalanobis’ D?, 70 
multiple correlation, 63, 65 
partial correlation co-efficient, 69 
U-statistic, comparison of two D? 
values, 74 
variances and covariances, 66 
Wilks’s A criterion, 260 
Minter, A. H., 128r, 163, 175r 


Nair, U. S., 52, 74r, 260, 272r 
Nandi, H. K., 74r 
Neurosis, the discrimination problem of, 
316 
factors of, 370 
Neyman, J., 130, 175r, 224, 235r, 286, 
272r, 277, 278, 280, 282, 285, 349r, 
350r 
Neyman-Pearson lemma, 309, 339 
generalization of, 339 
a slight variation of, 341 


INDEX 


Neyman-Pearson theory of hypothesis 
testing, 278 
Noncentral x? distribution, 50 
Non-null ¢ distribution, 48 
Normal equations, 76 
intrinsic properties, 79 
Normal population, distribution of mean 
and variance, 46 
distribution of the square, 40 
distribution of sum of variates, 39 
estimation of parameters, 133, 139, 
140, 148, 150 
moments of, 39 
see also Multivariate normal population 
Nuisance parameter, 201 
Null hypothesis, 274 
Null vector, 1 


Observational equations, 75 
see also Estimation, linear 
Orthogonal transformation, 19 

Orthogonal vector space, 3 


P, distribution, 44 
application of, 217 
Pairs of quadratic forms, 25 
Panse, V. G., 330, 350r 
Partial correlation coefficient, 69, 365 
Partitioned matrix, 9 
Pearson, E. S., 224, 228, 235r, 236, 272r, 
277, 278, 280, 282, 285, 349r, 350r 
Pearson, K., 33, 38, 74r, 112, 115, 128r, 
227n, 300, 321, 350r, 355 
Pearson’s (Karl) Py, 44 
coefficient of racial likeness, 355 
Penrose, L. S., 307, 350r 
Pivotal condensation, 29 
Poisson variate, conditional distribu- 
tions, 36, 37 
distribution function, 38 
distribution of the sum, 36, 37 
estimation of the mean, 135 
large sample tests, 205 
moments of, 36 
Power function, 277 
a lemma on, 342 
Prediction formula, 103 
for cranial capacity, 103 
Psychological illustrations, discrimina- 
tion of neurotic conditions, 316 


INDEX 


Psychological illustrations, factors of 
neurosis, 370 

sociability of village and city recruits, 
202 


speech and physical defects, 199 


Quadratic forms, classification of, 19 
distribution of, 55 
reduction of two forms, 25 

Quantiles, asymptotic distribution, 156 


Range, estimation of, 142 
Rank of a matrix, 2 
Rao, C. R., 74r, 128r, 150, 175r, 220r, 
272r, 316, 350r, 357, 378r 
Reciprocal matrix, 13 
numerical evaluation, 31 
Recombination fraction, 167 
Regression coefficients, distribution of, 
63 
estimation of, 103 
test for, 104 
Residual sum of squares, 
squares 


see Least 


Scoring method, 165 
combination of data, 172 
in estimating linkage, 167 
gene frequencies, 169 
Similar regions, 286 
Sin“ transformation, 210 
Slater, Eliot, 375 
Slater, Patrick, 316, 350r, 371, 375, 378r 
Slutsky, E., 175r 
Slutsky’s theorem, 173 
Statistios used in tests of significance, see 
XD, E fb, aM, Ry, 24, U, Vi, 
W, A (Wilks’), z 
Student (W. S. Gosset), 87, 87n, 128r 
Student’s distribution, 47 
Student’s generalization of t, 239 
Student’s ż test, 87 
Sufficient statistics, 135 
distributions admitting, 137, 149 
an optimum property of, 139 
minimal set of, 144 
Sweep-out method, 2 


t, Student’s, distribution, 47 
non-null distribution, 48 


389 


t, Student’s, test, 87 
T, Hotelling’s, applications, 239 
distribution, 70 
Tanh™! transformation, 231 
Tests, distance power, 284 
likelihood ratio, 226, 259, 285 
locally most powerful unbiased, 280 
uniformly most powerful, 278 
see also Analysis of variance, Chi 
square, Homogeneity tests, Large 
sample tests, Multivariate analy- 
sis 
Tildesley, M. L., 355, 378r 
Tocher, K. D., 363 
Transformation of statistics, expression 
for variance under, 207 
log of standard deviation, 214 
sin~! root of binomial proportion, 
210 
square root of Poisson variate, 209 
tanh™ of correlation coefficient, 214 
Tschebysheff’s lemma, 152 
Turnbull, H. W., 29r 


U-statistic, application, 243, 250, 251, 
253 
distribution, 74 
Unbiased estimation, general, 129 
an important aspect of, 130 
linear, 75 
Unbiased minimum variance, see Esti- 
mation, minimum variance 
Uniformly most powerful test, 278 


V-statistic, generalization of D?, 257 


Variance, distribution of, in normal 
samples, 46 
information limit to, in estimation, 
130, 143, 144 


log transformation, 214 
test for a given inequality of variances, 
224 
test for homogeneity of variances, 226 
test for a specified value of, 221 
unbiased minimum variance, estimate 
of, 148 
see also Analysis of variance 
Vectors, linear independence of, 2 
orthogonality of, 2 
Vector space, basis of, 2 


390 INDEX 


Vector space, orthogonal, 3 Wilks’ A criterion, 258 
rank of, 2 Wishart, J., 74r, 121, 128r, 236, 272r 
Von Mises, R., 312, 349r Wishart’s distribution, 66 
W-statistic, difference of two D? values, | Yates, F., 87n, 168, 175r, 203, 220r 
255 Yates’ correction for continuity, 203 
Wald, A., 160, 175r, 178, 220r, 236, 272r, see also Dandekar’s correction, 203 
329, 350r 


Wilks, S. S., 236, 259, 260, 272r z distribution, 48 


orm No. 3. 
PSY, RES.L-1 
Bureau of Educational & Psychological 
Research Library. 


The book is to be returned within 
the date stamped last. 


MeS TE e a ooo TE 
2 8SEP 1961 


17. § c 


Se. Ceo Gore ere 


WBGP-59/60-5119C-5M 


RAR... 


